Reliable systems, good alerting and monitoring, and luck. 3 of the ingredients to making on-call not suck quite so much. Plus being paid for it, of course! @heyitsols on ‘Being Paid To Sleep’ at @DevOpsDaysLDN
@basti_tee A product or system designed without ongoing operational concerns will *appear* cheaper but likely end up costing significantly more in the long term compared to a system co-designed with the people who will actually operate it.
2/
"Software operability still suffers because #Devs are no closer to actually running the software that they build, and the SREs still don't have time to engage with Devs to fix problems when they arise.”
WARNING! You'll need PROPER observability. If you don't have proper introspection, the technical challenges will be almost impossible to overcome. In the mentioned example we were able to trace the annotated db queries through the complete code artifact.
https://t.co/C17hOHkyau
Can we build #observable services without logs? @glenathan, Senior Software Engineer @Geckoboard shares at #QConPlus a story of how it went: https://t.co/VFOhHByps3
💡Missed this talk? You can still register & have on-demand access to all the talks for 3 months.
#Observability
@onejasonknight@bandanjot @AbsorbingStates That said, the what's more important than "project versus product" is that the funding and execution model for the software evolution keeps that software viable in the long term.
Prefer sustainable pace and attention to operability rather then feature factory.
👍
@PierreVincent@glenathan This is so good to hear! 🙌
Weirdly, I was using (and wrote about) this kind of approach almost exactly 10 years ago:
- Dynamics severity levels
- Unique IDs
- A focus on events
#operability
https://t.co/7KCOIfbrZx
"If we have proper visualisation and better metaphors, we set much better conditions for our operators to be comfortable in understanding and responding to variations in our systems." @yurynino#QConLondon
Our latest version of the Multi-Team Software Delivery Assessment deck by @matthewpskelton@ConfluxHQ now on sale! Including the additional themes of #security, #teamtopologies, on-call and SRE & Reliability, this is our most comprehensive tool! https://t.co/MSjuEIo8l0
Nice example of a useful, in-app scheduled maintenance message from @MiroHQ 👏
"Upcoming scheduled maintenance: Saturday, March 26, 2022 at 5:30 AM (your time). Miro will be unavailable for 1 hour."
The "your time" bit is good #UX#operability#reliability
LARGE SYSTEMS USUALLY OPERATE IN FAILURE MODE, via @dangolant
Or like I used to say, your distributed system exists in a continuous state of partial degradation. There are bugs and flakes and failures all the way down, and hardly any of them ever matter. Until they do.
@aleixmorgadas We at @ConfluxHQ have been doing a lot of work around reliability over the past 2 or 3 years.
See https://t.co/9QTdQGAoc1 for some ideas about exploring and measuring reliability. 👍
In 2021 and beyond, showing a generic "Oops. Something went wrong" 500 error page is not just user-hostile but exposes major flaws in the product engineering approach.
➡️ Design for the UX under error conditions.
#operability#UX
Team Guides for Software https://t.co/Rs1vWCXKPB by Matthew Skelton, Rob Thatcher, Alex Moore, Chris Young, Mattia Battiston, Ash Winter, Rob Meaney, Manuel Pais and Chris O'Dell is the featured bundle on the Leanpub homepage! https://t.co/7B8N7ZWvYT cc @matthewpskelton
I'm lucky to work with lots of great people here @weareglofox but it's so cool see @clintonsweetnam
1. Design a system with #testability and #operability as a primary concern from the outset and then
2. Share how he and his team did it with the whole Engineering Department😍