Get notified when your GCP Compute Engine crashes due to HostError
In this post, we are going to discuss about the host-error failures that may occur to your GCP Compute Engines in case of a google cloud hardware or software infrastructure failure. Even though we have uptime metrics in stackdriver for compute instances , there is no readily available metric in stackdriver as of today to notify us on HostError failures. It is also important to set your compute engines availability policy to mitigate these events , so they get restarted automatically on a different host .
In order to get notified for these events , you need to configure a custom log metric .
Since these are very rare occurrences you might not have a prior event to capture the logs for this, so let’s see how this can be configured.
Go to Log-based-metrics under Logging.
Click on “CREATE METRIC”
(GCP recently upgraded their logs viewer ) If you are in the legacy logs viewer , click on the drop down on the right to convert it to an advanced filter , Enter the log filter parameters as below.
Replace the instance id with your compute engine’s instance id.
protoPayload.authenticationInfo.principalEmail = "firstname.lastname@example.org"
Check the filtered logs.
If you had an event, your log will look like the one below, I had marked the “X” in place of the actual values.
message: "Instance terminated by Compute Engine."
Add the required labels that needs to be captured and create the metric.
Now create the alert from the metric, using the dotted icon , it will direct you to the stackdriver’s alert policies.
Configure the alert with the custom log metrics, set the notification threshold in such a way that, if it goes above 0 it will send a notification through the selected notification channels
use the gcloud beta logging write command to test the custom log metric and the alert.
gcloud beta logging write | Cloud SDK Documentation | Google Cloud
Whether your business is early in its journey or well on its way to digital transformation, Google Cloud's solutions…
I hope you never need to encounter this issue, but if it happens, you know what to do.
Thanks for reading this post.