About Me

Site Reliability Engineer & Reliability Advocate

I’m Steve McGhee, a veteran Site Reliability Engineer (SRE) and Reliability Advocate at Google. With over 20 years of experience in the industry, I’ve spent more than a decade at Google building and maintaining some of the most reliable systems on the internet, including Android, YouTube, and Google Cloud.

Professional Journey

My journey in SRE began in the early days of the discipline. I was part of the original team that wrote the monitoring for the first-ever Android launch. Since then, I’ve focused on bridging the gap between theoretical reliability concepts and practical engineering implementation.

After a brief “summer vacation” working on cloud migration and architecture at a smaller enterprise, I rejoined Google to help external organizations adopt SRE principles. I currently serve as part of the Google Cloud Incident Response Core Team, helping our largest customers architect for high availability and sustainable operations.

Key Projects

  • r9y.dev: An open-source project and community hub for reliability engineering concepts, including the “Reliability Map.”
  • The Prodcast: Primary host for Season 3 of Google’s official SRE podcast, focusing on “Champions of the Internet.”
  • Publications: Co-author of O’Reilly reports including “Enterprise Roadmap to SRE” and “Building Reliable Services on the Cloud.”

Interests

  • Site Reliability Engineering: Scaling cultures and systems.
  • Public Speaking: Sharing reliability lessons at conferences like SREcon, KubeCon, and SLOconf.
  • Mental Models: Developing better ways to reason about complex distributed systems.

Resume

You can download my latest resume here.

Socials