Discover
Slight Reliability
139Β Episodes
Reverse
Send us a text This week I sit down and have a discussion with Amin Astaneh (from Certo Modo) about CI/CD. We cover the power of the standard change as a way to navigate ITIL while still implementing DevOps practices, what to monitor to make your CI/CD observable, single piece flow, testing in production, and so much more. You can find Amin on his company website https://certomodo.io, LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitter: https://twitter.com/aastaneh You can find t...
Send us a text "Environment issues are just incidents that happened to occur in a non-production environment"... so why do we treat them so differently? In this first episode of the 2024 season I reflect on how we handle incidents in non-prod environments. (Note: Had a few issues with noise suppression in OBS Studio cutting off the start of some words, will sort it for the next episode) You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitt...
Send us a text How could AI help human beings negotiate the mountains of telemetry we collect to get simple and fast insight? This week I'm joined by Ottermon AI CEO and founder Checo Pacheco about the lifecycle of observability coverage and tooling within organisations and how AI is helping to find signals amongst the noise and reduce cognitive load for SREs. We discuss... π The need for a layer of logic on top of our telemetry data π² The observability lifecycle of a DevOps team πΆ How most o...
Send us a text What is chaos engineering and how is it being used in 2025? This week I'm joined by Gremlin CEO and founder Kolton Andrus to discuss... πͺοΈ What is chaos engineering and what is its origins? πͺ΄ How has it evolved over the year? π€ The role of AI agents in SRE work π° Justifying the value of chaos engineering πββοΈββ‘οΈ How do I get started? ...and much more. You can find Kolton on: LinkedIn: https://www.linkedin.com/in/kolton-andrus-77315a2/ And you can find out more about Gremlin's n...
Send us a text What are Team Topologies? How can they be used to deliver value simpler and more effectively (and in a more humane way)? This week I'm joined by Luke McManus to discuss... β°οΈ What are the four team topologies? π Can we have too much collaboration? β Team interaction models π Cognitive load πββοΈββ‘οΈ Value dynamics mapping ...and much more. You can find Luke on: LinkedIn: https://www.linkedin.com/in/luke-mcmanus-agile/ Check out the recently released second edition of the Team Top...
Send us a text How do you begin contributing to an open source project? What's it like? What do you get out of it? This week I'm joined by Wendy Ha who shares her unique story of joining the Kubernetes project and becoming a contributor. We explore... β°οΈ What it's like working on one of the biggest open source projects in the world π The benefits of contributing to open source β How much time and effort does it take? π The unique challenges of contributing from APAC (and the need for more con...
Send us a text As an #SRE how do you influence senior leadership to get support and priority for the things you care about? To answer this question I'm joined by Nora Jones, founder of Jeli and now Head of Pricing, Product Strategy and Growth at PagerDuty. Our conversation touches on... π€ How understanding needs to flow both ways (between engineers and leaders) π¨ Reliability is as much an art as a science π Using napkin math to start conversations π§ Understand the system (your org) before try...
Send us a text This week I do a retrospective on the Slight Reliability podcast. π How many people listen to it? β€οΈ How do I feel about the show? π What's going well? πͺ΄ What could be better? β What's next for the show? If you want to check out the podcast that came before Slight Reliability, you can find Performance Time archived on YouTube here: https://www.youtube.com/@performance-time You can find Stephen on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Bluesky: https://bsky.app...
Send us a text Have you burned out at work? What was your experience? How did you work through it? This week I'm joined by the incredible Colette Alexander to discuss what burnout is, what it means, and we both share our personal experiences burning out at work. We cover... π₯ What is burnout? β Why does it happen? π« What are the symptoms? π₯ Fight, flight, or freeze π§βπ Advice on how to recover ...and much more. Resources from the show... Why you're so angry at work (and what to do about it) b...
Send us a text This week I'm joined by the wonderful Hanson Ho to discuss the unique challenges and opportunities in making our mobile apps observable! We cover... π± The mobile/backend observability divide βοΈ The challenge of distributed tracing on mobile apps π The entire device runtime environment matters for your app π€ The quest for user-centric mobile observability β
Advice on how to get started with mobile observability ...and much more. You can find Hanson on: LinkedIn: https://www.link...
Send us a text This week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover... ποΈββοΈ Reliability VS Robustness VS Resilience π§© What is a complex system? π’ Safety one/safety two π§ Mental models π© Human error ...and so much more. Resources from this episode: Four concepts for resilience (paper) by Dr. David Woods https://www.researchgate.net/publication/276139783_Four_concepts_for_resilience_and_the_implication...
Send us a text This week on the 100th episode I'm joined by DevOps and Resilience Engineering legend John Allspaw to talk about learning (especially from incidents). We discuss... π Classroom VS situated learning π€ The myth of the perfect handover ITIL as a coping strategy to try and make sense of the organic, wild, and messy π₯ How you cannot incentivise to avoid incidents (it doesn't work that way) β€οΈβπ©Ή You can't understand how something is broken unless you know how it's supposed to work i...
Send us a text This week I'm joined by SRE leader Trent Hornibrook who shares a story about how he improved on-call early in his career, and then we explore the broader theme of focusing on the things that matter in observability, incident response, on-call, and beyond. We discuss... π Empowering engineers to implement change in your org π§βπΌ Focusing on what matters (customer & business > technology) π Not just adding more monitoring as the output of each PIR π How autonomy can lead to...
Send us a text This week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore... π Is the root cause of every incident the big bang? π¦ How the value of root cause degrades as complexity increases π«£ That if the culture is not blameless, people will hide things π³ Alternative approaches to root cause analysis such as branching timelines π Getting someone without skin in the ga...
Send us a text This week I'm joined by David Dick from 2 Steps to (finally!) discuss synthetic monitoring. We cover... π€ What is synthetic monitoring? π¦Ύ What are the benefits and drawbacks to using it? β’οΈ Non-web based synthetics (the tough stuff) πΉ Combining RUM and synthetics π«’ Does synthetics need an OTEL-like framework? ...and much more. You can find David on: LinkedIn: https://www.linkedin.com/in/david-dick/ You can find more about 2 Steps at https://2steps.io/# You can find Stephen on: ...
Send us a text This week I'm joined by Cin7 Engineering Director Milan Brown to unpack the challenges of technology management and leadership. We discuss... βοΈ Theory X vs Theory Y management π£οΈ Intention based leadership and communication π’ Conditions in an org for people to thrive π΅βπ« How do you learn to manage and lead? π«€ Managing people when you're not an expert in what they do ...and much more. Resources mentioned during the episode: Turn The Ship Around! (book): https://davidmarquet.com...
Send us a text This week Leon Adato and I break down the state of applying for roles in tech. We cover... π What a resume or CV is and is not π€ Leveraging your connections rather than relying on applying cold πͺ How most job descriptions are works of fiction π¦Ύ White-fonting to game AI resume assessment π§ͺ Experimental ways we could recruit ...and our pitch for Kubernetes the Rock Opera (and much more) You can find Leon's job postings weekly on his website: https://www.adatosystems.com/category/...
Send us a text This week Priyam Kumar shares his story of moving from a massive organisation to a startup and the challenges and growth that came from that. We discuss... πͺ War stories and examples of production incidents π©Ή The "hacks" we build to keep things running (and how maybe that's just normal) π Keeping it simple... YAGNI (You Ain't Gonna Need It!) π§― The perils of getting stuck in reactive mode π Areas of of learning if you want to get into SRE ...and much much more. You can find Priy...
Send us a text This week Michelle Casey shares her insights as a 'head of' engineering manager in the SRE context. This was one of my favourite conversations on the podcast so far. We cover topics such as... π€·π½ Why move into leadership? ποΈ Learning from other leaders π What is unique about SRE leadership? π Women in engineering leadership ...and we go through some feedback I got as a leader recently. Resources that Michelle mentions during the episode: The Five Dysfunctions of a Team (book): ...
Send us a text This week Adam and I get philosophical about what constitutes maturity in the field of observability. We tackle questions such as... πΈ Does your org treat observability as a cost centre or a value add? π₯ Are you using observability reactively to solve problems? Or proactively to build better products and services? π€ Is your observability connected to your users and business in a meaningful way? π Is monitoring the social media sentiment of your product part of observability? .....






















