Talks and Videos from Steve

Here you’ll find a collection of video presentations I’ve given at various conferences and meetups.

Google Cloud Next 2024 = Developer Keynote (UNCUT)

Google Cloud Next 2024 = Developer Keynote (UNCUT)

This video is the uncut developer keynote from Google Cloud Next 2024.

Reliable Systems through Platform Engineering

Reliable Systems through Platform Engineering

Infrastructure breaks, but systems can persist! In this talk, Steve presents concepts like spanning failure domains and using generic mitigations through Platform Engineering, as well as introducing a lab environment where teams can experiment with these capabilities directly.

Infrastructure as code

Infrastructure as code

This talk provides a comprehensive introduction to Infrastructure as Code (IaC), covering fundamental concepts, best practices, and practical approaches to getting started with managing infrastructure programmatically.

Achieving DORA outcomes by building reliability | Steve McGhee | SREday San Francisco 2024

Achieving DORA outcomes by building reliability | Steve McGhee | SREday San Francisco 2024

This video discusses achieving DORA (DevOps Research and Assessment) outcomes by building reliability.

Reliability Theory and Practice

Reliability Theory and Practice

In this presentation for GDG Cloud Southlake, Steve McGhee explores how organizations can move beyond the 'myth' of 100% uptime to a more controlled, engineering-based approach to availability. Key themes include Reliability as a Choice, Risk and Probability, Control Over Maximization, and Practical Strategies like Canary releases and symptom-based alerting.

Archetypes for Reliable Systems

Archetypes for Reliable Systems

In this presentation, Steve McGhee and Ameer Abbas introduce a comprehensive model for designing and operating reliable cloud services, exploring various system archetypes and their implications for reliability.

Panel Discussion: Modern Monitoring and Observability

Panel Discussion: Modern Monitoring and Observability

This video is a panel discussion about Modern Monitoring and Observability, featuring Jason Hand of Datadog, Ernest Mueller of Accenture, Steve McGhee of Google, and Peco Karayanev of PagerDuty. Hosted by PagerDuty DevOps Advocate Mandi Walls.

Achieving DORA outcomes via Reliability Engineering in your Platform

Achieving DORA outcomes via Reliability Engineering in your Platform

This video discusses achieving DORA (DevOps Research and Assessment) outcomes through the application of Reliability Engineering principles within your platform.

Safe serverless deployments with Cloud Run

Safe serverless deployments with Cloud Run

This presentation covers best practices and strategies for implementing safe deployment patterns in serverless environments using Cloud Run, ensuring reliability and minimizing risk during deployments.

Two Paths in the Woods

Two Paths in the Woods

Steve from Google and Sal from Nobl9 present two independently developed methods for teaching SRE topics to customers, which we have discovered are actually quite similar. Steve presents the 'reliability map' and Sal shows Nobl9's SLODLC.

Build Reliable Applications with Kubernetes and Istio

Build Reliable Applications with Kubernetes and Istio

Steve McGhee and Ameer Abbas explore the intersection of reliability engineering with Kubernetes and Istio, demonstrating how these technologies can be leveraged to build and maintain reliable applications at scale.

How to get started with SLI/SLO

How to get started with SLI/SLO

This talk provides a comprehensive introduction to Service Level Indicators (SLIs) and Service Level Objectives (SLOs), drawing from Steve McGhee's experience implementing these practices at Google.

Rootly Humans of Reliability

Rootly Humans of Reliability

In this segment with Rootly, Steve McGhee shares insights about his role in reliability engineering and discusses the importance of human factors in building and maintaining reliable systems.

Enterprise Roadmap for SRE

Enterprise Roadmap for SRE

This presentation provides a detailed roadmap for implementing Site Reliability Engineering (SRE) practices in enterprise environments, covering strategies, challenges, and best practices for successful adoption.

Non-standard SLOs: beyond availability

This talk delves into the world of non-standard Service Level Objectives (SLOs), moving beyond traditional availability metrics to explore more nuanced ways of measuring and ensuring service reliability.

Observability Fundamentals: Open Standards

In this lightning talk, Steve McGhee covers the essential concepts of observability and the importance of open standards in building observable systems.

SRE in Enterprise

Steve McGhee and James Brookbank present on the 'Enterprise Roadmap for SRE' O'Reilly report, addressing the challenges enterprises face in adopting Site Reliability Engineering (SRE) and strategies to overcome them.

How VMs are the Matryoshka doll of compute

How VMs are the Matryoshka doll of compute

This video explores a cloud-native future, discussing how to determine the reliability of engineering patterns and identify which parts of your system require upgrades.

Putting SRE Principles in Platform Engineering

Putting SRE Principles in Platform Engineering

This talk explores how to effectively apply Site Reliability Engineering (SRE) principles to platform engineering. We discuss practical approaches to building reliable platforms and how SRE methodologies can enhance platform operations and development.

Build a Software Delivery Platform with Anthos

Build a Software Delivery Platform with Anthos

This video explains how with Anthos and GitLab you can build a multi-team platform for streamlining CI/CD, policy management and developer onboarding across your organization and teams.