Publications Selected publications and articles on SRE, reliability, and cloud systems. Building Reliable Services on the Cloud March 1, 2022 Authors: Steve McGhee, Phillip Tischler, Shylaja Nukala Publisher: O'Reilly Media Explores systematic resilience for sustained innovation, focusing on designing reliability into cloud-native architectures from the start using SLOs and operational health strategies. Read Report Enterprise Roadmap to SRE January 1, 2022 Authors: Steve McGhee, James Brookbank Publisher: O'Reilly Media A practical guide for large organizations to bridge the gap between initial enthusiasm for SRE and its actual implementation. Focuses on mapping SRE principles to existing enterprise structures. Read Report Finding a Problem at the Bottom of the Google Stack March 12, 2020 Authors: Steve McGhee Publisher: Google Cloud Blog A deep dive into a 'million-to-one' failure where broken rack wheels caused server overheating, illustrating the SRE principle that 'all incidents should be novel.' Read Article