Sre

Health checks: What? When? How?

Health checks: What? When? How?

This article surveys various health checks in Google Cloud. If you want to learn more, leave your preferences in the feedback form 🔗 .

Generally speaking, a health check is a function or a method to indicate a general state (a.k.a. health) of the underlying service. Some products elaborate the definition of “general state” to be something particular, such as the ability of the service to respond to requests.

Health checks are an important instrument of service observability. When provided as a tool or service they replace development of coding metrics ingestion, collection and analysis. Health checks often come integrated with alerting or incident response solutions. In many scenarios health checks can be sufficient to increase product reliability to desired level.

Define Google Cloud Managed Service for Monitoring

You may have seen this notice when opening SLOs Overview 🔗 in Cloud Console.

image

This notice announces a recent change in the way of defining services 🔗 for Cloud Monitoring. Before the change, Cloud Monitoring automatically discovered services that were provisioned in AppEngine, Cloud Run or GKE. These services were automatically populated in the Services Overview dashboard 🔗 . After the change, all services in the Services Overview dashboard have to be created explicitly. To simplify this task, when defining a new service 🔗 in UI you are presented with a list of candidates that is built based on the auto-discovered services. The full list of the auto-discovered services includes managed services 🔗 from AppEngine, Cloud Run and Istio as well as GKE workloads and services. Besides UI you can add managed services to Cloud Monitoring using the services.create API 🔗 or using the Terraform google_monitoring_service resource 🔗 .

Google Cloud SLO demystified: Uncovering metrics behind predefined SLOs

Google Cloud supports service monitoring 🔗 by defining and tracking SLO 🔗 of the services based on their metrics that are ingested to Google Cloud. This support greatly simplifies implementing SRE practices for services that are deployed to Google Cloud or that store telemetry data there. To make it even more simple to developers, the service monitoring is able to automatically detect many types of managed services and supports predefined availability and latency SLI 🔗 definitions for them.
When you define a new SLO you are prompted to select a predefined SLI or to define your own.