Health checks: What? When? How?

Health checks: What? When? How?

This article surveys various health checks in Google Cloud. If you want to learn more, leave your preferences in the feedback form.

Generally speaking, a health check is a function or a method to indicate a general state (a.k.a. health) of the underlying service. Some products elaborate the definition of “general state” to be something particular, such as the ability of the service to respond to requests.

Health checks are an important instrument of service observability. When provided as a tool or service they replace development of coding metrics ingestion, collection and analysis. Health checks often come integrated with alerting or incident response solutions. In many scenarios health checks can be sufficient to increase product reliability to desired level.

Google Cloud provides built-in health check services to enable Software developers and DevOps engineers to speed up development and simplify product’s observability.

Different kinds of health checks in Google Cloud

Different types of health checks in Google Cloud let users solve different tasks including, auto-capacity draining and load balancing, auto-healing, managing Kubernetes and cloud native workloads, and alerting on reliability thresholds.

  • Health check policies for Google Cloud load balancers (GCLB) and managed instance groups (MIGs) are most senior health check services in Google Cloud. The policies configure software tasks that Google Cloud periodically executes to probe backend services of load balancers or VMs in MIGs to determine their status by sending a request using the configured protocol and additional parameters in the configuration. These tasks are often referenced as probes. In order to execute the probes, the health checks require network configuration to open port(s) that the policy configuration defines on the backends or VMs for access from the probe IP ranges. The policies support a variety of network protocols including gRPC, TCP, SSL, HTTP, HTTPS, and HTTP/2. The list of the protocols is limited to HTTP and HTTPS only for one type of GCLB, the external passthrough network load balancers.
  • Kubernetes health checks are natively supported by GKE and Cloud Run service offerings. If you have them configured in your on-premises environment, they will work the same in Google Cloud. Unless you have a sophisticated network topology for your cluster no network configuration adjustments required to run the health check probes.
  • Uptime checks let you instruct Google Cloud to periodically query your application that responds to HTTP, HTTPS, or TCP requests. These checks are mimicking the end-user of an application and verifying resource availability on an ongoing basis. Uptime monitoring is especially valuable during times when traffic is low due to time and seasonality, e.g. night time or holidays. Uptime checks can test both public and private endpoints, and they can validate the response data.
  • For application owners looking to monitor their critical user journeys, Google Cloud recently announced Synthetic monitors, which is a capability that uses automated tests to simulate user interactions with your application. This allows you to monitor the availability, consistency, and performance of your applications, and key business workflows from the perspective of a real user, on a continuous basis. To create these synthetic monitors, you start with a framework provided by Cloud Monitoring—custom or Mocha—and then write your tests. You can use Gemini Code Assist to generate the test code for your synthetic monitor.
  • One special case of the synthetic monitors is broken-link checkers. It is implemented using the same framework that Cloud Monitoring provides for synthetic monitors and it periodically tests URIs and a configurable number of links found at that URIs.

Except for Kubernetes health checks, all other health check types are managed by Google Cloud. The following list provides a reference to Google APIs and Terraform resources for provisioning and controlling these managed health checks. You can also manually define them using the Cloud console.

Health checks Google API reference Terraform resource
Health check policy APIs compute_health_check
Kubernetes health checks not a resource not a resource
Uptime checks APIs google_monitoring_uptime_check_config
Synthetic monitors Use uptime checks API with SyntheticMonitorTarget special case of google_monitoring_uptime_check_config

Which health check to use

As you see there are many different health check types. Which health check do you need? Should you use more than one simultaneously? These are the questions that are not easy to answer.

Let’s see which health check is better suited to what job first.

The GCLB and MIG health check policies do a great job to improve load balancing (using capacity draining) and auto-repairing at very low investment cost. It is because you do not need to do anything besides configuring the policies. Your workload has to demonstrate responsiveness to some endpoint. So, it is easy to configure the policy to probe that endpoint. The health check policies are especially useful with legacy applications that load balanced and scale horizontally using MIGs and which troubleshooting playbooks often start with “restart the server” instruction. Once the health check policies are configured, GCLB and MIG will do all the work for autorepair and capacity control for you. Additionally you can use the health check logs to generate log-based metrics or to define alerts. Be mindful that you will be billed for the volume of the logs. For this reason you will have to explicitly enable health check logging.

Kubernetes health checks are a natural solution for Kubernetes workloads. It is also the recommended solution for Kubernetes workloads that run on Cloud Run. For workloads that run on GKE and expose via Kubernetes services of the load balancer type you can use GCLB health checks in addition to the native Kubernetes liveness and readiness probes. However, this approach is not recommended. It increases complexity of the setup and risks introducing unexpected conflicts between Kubernetes container-based load balancing and load balancing strategy of GCLB.

The uptime checks and synthetic monitoring (including broken-link checks) can be used with any types of the workloads and in parallel with other types of health checks. They prioritize the customer experience, monitoring the business workflows end-to-end, making them suited best for use in SRE workflows. The main difference between the health check policies and the uptime checks is that the uptime checks issue requests from multiple locations throughout the world to publicly available URLs or Google Cloud resources while health checks policies probe the backend directly. The uptime checks are also able to issue requests to URLs or Google Cloud resources exposed only on the Virtual Private Cloud network (VPC). These uptime checks are called private uptime checks. You can then leverage Monitoring dashboards and alerts to track uptime check metrics and trigger events or notifications according to your Ops policies. The synthetic monitors let you introduce more sophisticated reliability techniques. At the core, the synthetic monitors are Cloud Function instances that are written in Node.js and rely on the open source Synthetics SDK framework. For example, you can implement true latency monitoring by implementing a synthetic client that issues requests from multiple locations throughout the world to measure response times from your service.

Additional reading