Get notified when your GCP Compute Engine crashes due to HostError

credits: unsplash photos , photo taken by taylor vick

In this post, we are going to discuss about the host-error failures that may occur to your GCP Compute Engines in case of a google cloud hardware or software infrastructure failure. Even though we have uptime metrics in stackdriver for compute instances , there is no readily available metric in stackdriver as of today to notify us on HostError failures. It is also important to set your compute engines availability policy to mitigate these events , so they get restarted automatically on a different host .

Availability policy for compute engines

In order to get notified for these events , you need to configure a custom log metric .

Since these are very rare occurrences you might not have a prior event to capture the logs for this, so let’s see how this can be configured.

Step 1:

Go to Log-based-metrics under Logging.

Log-based-metrics under Logging

Step 2:

Click on “CREATE METRIC”

Step 3:

(GCP recently upgraded their logs viewer ) If you are in the legacy logs viewer , click on the drop down on the right to convert it to an advanced filter , Enter the log filter parameters as below.

Replace the instance id with your compute engine’s instance id.

resource.type="gce_instance"
resource.labels.instance_id="XXXXXXXXXXXXXXXXX"
protoPayload.methodName="compute.instances.hostError"
protoPayload.authenticationInfo.principalEmail = "system@google.com"

Step 4:

Check the filtered logs.

If you had an event, your log will look like the one below, I had marked the “X” in place of the actual values.

{
insertId: "XXXXXXXXXXXX"
logName: "projects/XXX-ProjectName-XXX/logs/cloudaudit.googleapis.com%2Fsystem_event"
operation: {
first: true
id: "systemevent-XXXXXXXXXXXX-XXXXXXXXXXXX--XXXXXXXXXXXX--XXXXXXXXXXXX-"
last: true
producer: "compute.instances.hostError"
}
protoPayload: {
@type: "type.googleapis.com/google.cloud.audit.AuditLog"
authenticationInfo: {
principalEmail: "system@google.com"
}
methodName: "compute.instances.hostError"
request: {
@type: "type.googleapis.com/compute.instances.hostError"
}
resourceName: "projects/XXX-ProjectName-XXX/zones/XXX-ZoneName-XXX/instances/XXX-InstanceName-XXX"
serviceName: "compute.googleapis.com"
status: {
message: "Instance terminated by Compute Engine."
}
}
receiveTimestamp: "XXXX-XX-XXTXX:XX:XX.XXXXXXXZ"
resource: {
labels: {
instance_id: "XXX-InstanceID-XXX"
project_id: "XXX-ProjectName-XXX"
zone: "XXX-ZoneName-XXX"
}
type: "gce_instance"
}
severity: "INFO"
timestamp: "XXXX-XX-XXTXX:XX:XX.XXXXXZ"
}

Step 5:

Add the required labels that needs to be captured and create the metric.

Step 6:

Now create the alert from the metric, using the dotted icon , it will direct you to the stackdriver’s alert policies.

Create Alert from Metric

Step 7:

Configure the alert with the custom log metrics, set the notification threshold in such a way that, if it goes above 0 it will send a notification through the selected notification channels

Step 8:

use the gcloud beta logging write command to test the custom log metric and the alert.

I hope you never need to encounter this issue, but if it happens, you know what to do.

Thanks for reading this post.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rajathithan Rajasekar

Rajathithan Rajasekar

107 Followers

I like to write code in Python . Interested in cloud , dataAnalysis, computerVision, ML and deepLearning. https://www.linkedin.com/in/rajathithan-rajasekar/