Skip to main content
Version: 0.9.x

TSB Alerting Guidelines

note

Tetrate Service Bridge collects a large number of metrics and the relations between those differ from environment to environment. This document only outlines generic alerting guidelines rather than providing an exhaustive list of alert configurations and thresholds, since these will differ between different environments with different workload patterns.

Overall, the alert configuration should follow several principles:

  • Every alert must be urgent and actionable. Alerts that do not require an immediate response should be notifications or tasks/tickets instead.

  • Number of alerts should be kept to a minimum to avoid alert fatigue of your oncall.

  • Avoid redundant alerts.

  • Alert on symptoms and not the cause, when applicable.

  • Every alert must have an up-to-date playbook/runbook that serves as a source of truth for impact, troubleshooting scenarios and documentation.

TSB Operational Status

TSB Availability

The rate of successful requests to TSB API. This is an extremely user-visible signal and should be treated as such.

The THRESHOLD value should be established from a historical metrics data used as a baseline. A sensible value for a first iteration would be 0.99. Example PromQL expression:

sum(rate(grpc_server_handled_total{component="tsb", grpc_code="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) BY (grpc_method) / sum(rate(grpc_server_handled_total{component="tsb", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) BY (grpc_method) < THRESHOLD

TSB Request Latency

TSB GRPC API request latency metrics are intentionally not emitted due to high metric cardinality. Tetrate is in the process of gathering feedback on the necessity and usefulness of API GRPC latency to be added back in future releases.

TSB Request Traffic

The raw rate of requests to TSB API. The monitoring value comes mostly from detecting outliers and unexpected behaviour, e.g. an unexpectedly high or low request rate. To establish correct thresholds, it is important to have the history of metrics data to gauge the baseline. Example PromQL expression:

sum(rate(grpc_server_handled_total{component="tsb", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) BY (grpc_method) < (or >) THRESHOLD

TSB Absent Metrics

TSB talks to its persistent backend even without constant external load. Absence of these requests reliably indicates an issue with TSB metrics collection and should be treated as a high-priority incident as the lack of metrics means the loss of visibility into TSB status.

note

One of the common cases of this issue is a deadlock in Opentelemetry collector. If this alert fires, one of the first steps should be to check the otel-collector pod status and restart it if needed. Tetrate is currently working with upstream maintainers to address this bug.

Example PromQL expression:

sum(rate(persistence_operation[10m])) == 0

Persistent Backend Availability

Persistent backend availability from TSB with no insight into the internal Postgres operations.

TSB stores all of its state in the persistent backend and as such, its operational status (availability, latency, throughput etc) is extremely tightly coupled with the status of persistent backend. TSB records the metrics for persistent backend operations that may be used as a signal to alert on.

It is important to note that any degradation in persistent backend operations will inevitably lead to overall TSB degradation, be it availability, latency or throughput. This means that alerting on persistent backend status may be redundant and the oncall person will receive 2 pages instead of one whenever there is a problem with Postgres that requires attention. However, such a signal still has significant value in providing important context to decrease the time to triage the issue and address the root cause/escalate. In this case alerting on the cause and not the symptom is a trade-off between having technically redundant alerts and reducing the time to triage the issue.

Note on treatment of "resource not found" errors: some level of "not found" responses is normal because TSB, for the purposes of optimisation, often uses Get queries instead of Exists in order to determine the resource existence. However, a large rate of "not found" (404-like) responses likely indicates an issue with the persistent backend setup.

Example PromQL expressions:

  • Queries:
1 - ( sum(rate(persistence_operation{error!="", error!="resource not found"}[1m])) / sum(rate(persistence_operation[1m])) OR on() vector(0) ) < < THRESHOLD
  • Too many "resource not found" queries:
( sum(rate(persistence_operation{error="resource not found"}[1m])) OR on() vector(0) / sum(rate(persistence_operation[1m])) ) > THRESHOLD (e.g. 0.50)
  • Transactions:
sum(rate(persistence_transaction{error=""}[1m])) / sum(rate(persistence_transaction[1m])) < THRESHOLD

Persistent Backend Latency

The latency of persistent backend operations as recorded by the persistent backend client (TSB). This latency effectively translates to user-seen latency and as such is a vital signal.

The THRESHOLD value should be established from a historical metrics data used as a baseline. A sensible value for a first iteration would be 300ms 99th percentile latency.

Example PromQL expressions:

  • Queries
histogram_quantile(0.99, sum(rate(persistence_operation_duration_bucket[1m])) by (le, method)) > THRESHOLD
  • Transactions:
histogram_quantile(0.99, sum(rate(persistence_transaction_duration_bucket[1m])) by (le)) > THRESHOLD

TSBD Operational Status

Last Management Plane Sync

The max time elapsed since tsbd last synced with the management plane for each registered cluster. This indicates how stale the configuration received from the management plane is in a given cluster. A reasonable first iteration threshold here is 30 (seconds).

Example PromQL expression:

time() - max(tsbd_tsb_latest_sync_time{cluster_name="$cluster"}) > THRESHOLD

TSBD Saturation

TSB Control Plane components are mostly CPU-constrained. Thus, the CPU utilisation serves as an important signal and should be alerted on. Keep in mind when choosing the alert THRESHOLDs that not only cloud providers tend to overprovision CPU, but even hyperthreading may have negative effects on Linux scheduler efficiency and lead to increased latencies/errors even at <~80% CPU utilisation.

Istio Operational Status

NB: this is not an exhaustive list of valuable signals that Istio Data Plane provides. For more in-depth information please refer to:

https://istio.io/latest/docs/examples/microservices-istio/logs-istio/ https://istio.io/latest/docs/ops/best-practices/observability/ https://istio.io/latest/docs/concepts/observability/

This document describes the absolute bare minimum alerting setup for Istio service mesh.

Proxy Convergence Time

Delay in seconds between config change and a proxy receiving all required configuration. This is another part of configuration propagation latency.

Example PromQL expression:

histogram_quantile(0.99, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le)) > THRESHOLD

Istiod Error Rate

The error rate of various Istiod operations. To establish correct thresholds, it is important to have the history of metrics data to gauge the baseline.

Example PromQL queries:

  • Write Timeouts:
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) > THRESHOLD
  • Internal Errors:
sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) > THRESHOLD
  • Config Rejections:
sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) > THRESHOLD
  • Write Timeouts:
sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) > THRESHOLD

Configuration Validation

The success rate of Istio configuration validation requests. Elevated errors indicate that the Istio configuration generated by tsbd that is being propagated is not valid and this should be urgently addressed.

Example PromQL expression:

sum(rate(galley_validation_passed{cluster_name="$cluster"}[1m])) / ( sum(rate(galley_validation_passed{cluster_name="$cluster"}[1m])) + sum(rate(galley_validation_failed{cluster_name="$cluster"}[1m])) ) < THRESHOLD

Capacity Planning and Resource Saturation

TSB, tsbd, OAP/Zipkin Saturation

TSB components are mostly CPU-constrained in addition to OAP/Zipkin memory utilisation depending on the amount of telemetry/traces they collect. Thus, the CPU utilisation serves as an important signal and should be alerted on. Even though it is not a direct symptom of an issue affecting users, saturation provides a valuable signal that the system is underprovisioned/oversaturated before it makes a negative user impact.

Keep in mind when choosing the alert THRESHOLDs that not only cloud providers tend to overprovision CPU, but even hyperthreading may have negative effects on Linux scheduler efficiency and lead to increased latencies/errors even at <~80% CPU utilisation.