Etcd size monitoring in GKE

Etcd size monitoring in GKE

Google Cloud lets you run Kubernetes in three flavors:

  • Vanilla is when you do all on your own. This is also the quickest “lift and shift” strategy to migrate your cluster to cloud. Essentially it is just a group of virtual machines that run on Google Compute Engine (GCE).
  • Managed that shifts administration and maintenance tasks from DevOps teams to the cloud service. See Google Kubernetes Engine (GKE) Standard cluster architecture and Autopilot for more details.
  • Knative, that is sometimes referred to as cloud native, which hides control plane and other infrastructure details behind the familiar interface of workload launching. Cloud Run offers running service and job workloads using the GKE platform behind the Knative interface.

Many DevOps teams prefer the managed flavor to enjoy a balance between carefree administration and the level of control that is very close to vanilla Kubernetes. Comparing GKE Autopilot and Standard, many prefer Standard due to higher control granularity over node management, security and version configuration and other options. In the cluster observability domain, these differences are less distinctive since both come with a rich set of monitoring and logging capabilities including control plane metrics.

Observability of the cluster’s control plane includes a curated set of API server, scheduler, controller manager and kubelet metrics. Monitoring these metrics is especially important when running large GKE clusters or when workloads leverage Kubernetes API and Kubernetes Resource Model (KRM) putting a strain on underlying managed level and risking to reach the GKE limits.

There is one more metric that can be collected although you cannot find it in documentation about [Kubernetes] or [Knative] managed metrics. This metric is Etcd database size. The Etcd database size depends on the number of instances Kubernetes API created including custom resources. If the size of the database exceeds the GKE limit of 6GB, the cluster’s control plane will become unresponsive. This metric can be monitored using Cloud Console. You can gcloud CLI command:

gcloud alpha quotas list --service='container.googleapis.com'

or REST API call to Quotas API service (do not forget to enable cloudquotas.googleapis.com API):

curl \
  'https://cloudquotas.googleapis.com/v1/projects/[YOUR_PROJECT]/locations/global/services/container.googleapis.com/quotaInfos/EtcdDatabaseSizeBytes' \
  --header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \
  --header 'Accept: application/json' \
  --compressed

Additionally, you can read this metric in your code with one of the client libraries.

If your cluster is at risk of reaching the Etcd database size limit, consider setting up an alert on your quota usage. You can also automate the alert response using PubSub notification channel and Cloud Build or another execution pipeline. See this article for an example of alert automation.