Observability

Simplify Your OTel Trace With Google Cloud

OpenTelemetry (OTel) is the go-to standard for monitoring applications, offering a vendor-neutral way to capture telemetry data like traces, metrics, and logs. This enables consistent instrumentation and avoids vendor lock-in. Developers widely use OTel to instrument applications, with exporting telemetry data to Google Cloud Observability services.

OTel’s native data format follows OTLP (standing for OpenTelemetry Protocol) standard. To export OTel data to Google Cloud usually requires exporters like Google Cloud Trace Exporter for Go that exist for most of the popular programming languages.

DISCLAIMER: This post is not about log storage billing or managing log sinks.

Have you ever read or heard the phrase “Write everything to logs”? This is good advice. You never know what information can be useful or when. It is easy to do in Google Cloud. With help of audit logs all infrastructure, security and other cloud internal events are stored in Cloud Logging. And you can write application logs by simply printing them to stdout. However, there are situations when you may need to prevent some log entries from being stored:

Google Cloud provides efficient and not expensive storage for application and infrastructure logs. Logs stored in Google Cloud can be queried and analyzed using the analytical power of BigQuery. However, there are scenarios when Google Cloud customers may need to export log data from Google Cloud to third party (3P) solutions. This post reviews two main use cases of log exporting: exporting already stored logs and exporting logs while they are being ingested. The post focuses on how to configure and implement the part of the exporting process that handles extracting logs from Google Cloud. The part of loading the data into 3P solutions is not explored because of the variety of requirements and constraints that different 3P solutions expose.

PromQL stands for Prometheus Query Language. This post is about using PromQL in Cloud Monitoring. PromQL provides an alternative to the Metrics Explorer menu-driven builder and Monitoring Query Language (MQL) interfaces for exploring metrics, creating charts and alerts. Google Cloud introduced support for PromQL at the same time as Managed Service for Prometheus. Later, support for PromQL was introduced in Monitoring alert management. Practically it means that you can use PromQL instead of Monitoring Query Language (or MQL) to query Cloud Monitoring metrics in the Metrics Explorer, in custom dashboard configurations, and in alert management.

All the ways to scrape Prometheus metrics in Google Cloud

Production systems are being monitored for reliability and performance tracking to say the least. Monitored metrics ‒ a set of measurements that are related to a specific attribute of a system being monitored, are first captured in the executing code of the system and then are ingested to the monitoring backend. The selection of the backend often dictates the methods(s) of ingestion. If you run your workloads on Google Cloud and use self-managed Prometheus server and metric collection, this post will help you to reduce maintenance overhead and some billing costs by utilizing Google Cloud Managed Service for Prometheus for collecting and storing Prometheus metrics.

This article surveys various health checks in Google Cloud. If you want to learn more, leave your preferences in the feedback form.

Generally speaking, a health check is a function or a method to indicate a general state (a.k.a. health) of the underlying service. Some products elaborate the definition of “general state” to be something particular, such as the ability of the service to respond to requests.

Health checks are an important instrument of service observability. When provided as a tool or service they replace development of coding metrics ingestion, collection and analysis. Health checks often come integrated with alerting or incident response solutions. In many scenarios health checks can be sufficient to increase product reliability to desired level.

Google Cloud lets you run Kubernetes in three flavors:

Vanilla is when you do all on your own. This is also the quickest “lift and shift” strategy to migrate your cluster to cloud. Essentially it is just a group of virtual machines that run on Google Compute Engine (GCE).
Managed that shifts administration and maintenance tasks from DevOps teams to the cloud service. See Google Kubernetes Engine (GKE) Standard cluster architecture and Autopilot for more details.
Knative, that is sometimes referred to as cloud native, which hides control plane and other infrastructure details behind the familiar interface of workload launching. Cloud Run offers running service and job workloads using the GKE platform behind the Knative interface.

Many DevOps teams prefer the managed flavor to enjoy a balance between carefree administration and the level of control that is very close to vanilla Kubernetes. Comparing GKE Autopilot and Standard, many prefer Standard due to higher control granularity over node management, security and version configuration and other options. In the cluster observability domain, these differences are less distinctive since both come with a rich set of monitoring and logging capabilities including control plane metrics.

You may have seen this notice when opening SLOs Overview in Cloud Console.

This notice announces a recent change in the way of defining services for Cloud Monitoring. Before the change, Cloud Monitoring automatically discovered services that were provisioned in AppEngine, Cloud Run or GKE. These services were automatically populated in the Services Overview dashboard. After the change, all services in the Services Overview dashboard have to be created explicitly. To simplify this task, when defining a new service in UI you are presented with a list of candidates that is built based on the auto-discovered services. The full list of the auto-discovered services includes managed services from AppEngine, Cloud Run and Istio as well as GKE workloads and services. Besides UI you can add managed services to Cloud Monitoring using the services.create API or using the Terraform google_monitoring_service resource.

Google Cloud supports service monitoring by defining and tracking SLO of the services based on their metrics that are ingested to Google Cloud. This support greatly simplifies implementing SRE practices for services that are deployed to Google Cloud or that store telemetry data there. To make it even more simple to developers, the service monitoring is able to automatically detect many types of managed services and supports predefined availability and latency SLI definitions for them.
When you define a new SLO you are prompted to select a predefined SLI or to define your own.

Simplify Your OTel Trace With Google Cloud

Control What You Log

How to Export Google Cloud Logs

Using PromQL in Google Cloud

All the ways to scrape Prometheus metrics in Google Cloud

Health checks: What? When? How?

Etcd size monitoring in GKE

Define Google Cloud Managed Service for Monitoring

Google Cloud SLO demystified: Uncovering metrics behind predefined SLOs