Skip to main content
logoTetrate Service BridgeVersion: next

Alerting Guidelines

note

Tetrate Service Bridge collects a large number of metrics and the relationship between those, and the threshold limits that you set will differ from environment to environment. This document outlines the generic alerting guidelines rather than providing an exhaustive list of alert configurations and thresholds, since these will differ between different environments with different workload configurations.

The alerting configuration using Prometheus Operator CR can be downloaded from here.

For more information on how to configure PrometheusRules, refer to official Prometheus documentation.

TSB Health Status

Dashboard indicating key components health for fast diagnosis.

Green or red colored panels for each TSB component, which is automatically diagnosed based on current metrics contrasted against its historic and weighted by severity.

You can find it in the Grafana Dashboards download button here.

The dashboard is divided into core (top) and non-core (bottom) components.

Tetrate Service Bridge health status dashboard

A red light indicates it is urgent to check the given component's health, whether it is down, unable to report metrics, or the reported metrics indicate an uncommon behavior. Take this example with edge. As it is distributed in each control plane, you must dig deeper into which one and why it is being reported:

Tetrate Service Bridge health status dashboard alert

After identifying a failing component it is recommended to also check the more detailed alerts described later in this section: TSB and XCP Operational Status.

Diagnosis is limited

This tool is specially useful once your clusters are set up and operational. Early stages clusters, e.g. with incomplete onboardings, may fail to provide the necessary data for this diagnosis to be accurate.

Components dependencies

Since each component is tightly dependant upon the operation of the others, in some situations the failure of one will also trigger a red light on a different one. This is because of the nature of the metrics selected to diagnose the components. It is specially the case of envoy in the management plane: if it fails, edge, tsb and iam will have false positives. Be aware of the dependencies so you can debug the right direction.

MPCXCPXCPedAPIIAMOAPEnvoy
MPC-
XCP-
XCPed-
API-
IAM-
OAP-
Envoydepdepdep-

TSB Operational Status

Alerts describing the health of TSB Management Plane components.

TSB Absent Metrics

TSB talks to its persistent backend even without constant external load. An absence of requests reliably indicates an issue with TSB metrics collection, and should be treated as a high-priority incident as the lack of metrics means the loss of visibility into TSB's status.

Check the health of the metrics ingestion infrastructure - from TSB's otel-collector to your Prometheus installation.

TSB Request Traffic

The raw rate of requests to TSB API. The monitoring value comes principally from detecting outliers and unexpected behaviour, e.g. the rate of requests dropping to 0.

Things to check:

  • health of the cluster ingress.
  • health of TSB ingress infrastructure (e.g. AWS ALB etc).
  • health of the metrics ingestion and processing infrastructure, i.e. Prometheus.
  • possible DNS misconfiguration.
  • inspect the TSB Front Envoy logs for any additional clues (e.g. TSL termination errors, TSB cluster network partitioning signs etc).

TSB API Availability

The rate of successful requests to TSB API. This is an extremely user-visible signal and should be treated as such.

Drill down to the "TSB Error Rate" panel in the "TSB Operational Status" dashboard to find which gRPC method is producing errors. To troubleshoot further, consult the deployments/tsb logs.

For example, if the leading source of errors is e.g. GetTenant, searching the tsb component logs for the lines containing this method may provide additional insight into the nature of the errors.

Front Envoy Availability

The rate of successful requests to TSB Management Plane Front Envoy. TSB Management Plane Envoy Gateway (front-envoy) terminates all incoming requests to TSB Management Plane components. This alert is MUST be inhibited if the "TSB API Availability" alert is firing. Refer to Alertmanager docs on how to configure alert inhibition. The monitoring value of this alert is "catch high error rates of non-API TSB components, e.g. Control Plane sync, metrics propagation etc".

Drill down to the "Front Envoy Error Rate" panel in the "TSB Operational Status" dashboard to to narrow down which Envoy cluster is experiencing issues. Envoy cluster names correspond to the TSB Management Plane components, as described in the TSB architecture document.

TSB Front Envoy p99 Latency

The highest (99th percentile) latency of requests served by TSB Front Envoy. These include both the API requests (via UI or CLI) and XCP configuration propagation-related requests. High latency may be an indicator of one of TSB components overload/resource saturation and may impact both API experience (slow UI/CLI) and config propagation.

Drill down to the "Front Envoy Latency" panel in the "TSB Operational Status" dashboard to narrow down which Envoy cluster is experiencing issues. Envoy cluster names correspond to the TSB Management Plane components, as described in the TSB architecture document.

Further inspecting the relevant TSB Component logs may provide additional insight.

Also, it is always a good idea to inspect the "Data Store Latency" panel and search for any correlations between the increased TSB latency and Postgres operations latency.

Persistent Backend Availability

Persistent backend availability from TSB with no insight into the internal Postgres operations.

TSB stores all of its state in the persistent backend and as such, its operational status (availability, latency, throughput etc.) is tightly coupled with the status of the persistent backend. TSB records the metrics for persistent backend operations that may be used as a signal to alert on.

It is important to note that any degradation in persistent backend operations will inevitably lead to overall TSB degradation, be it availability, latency or throughput. This means that alerting on persistent backend status may be redundant and the oncall person will receive two pages instead of one whenever there is a problem with Postgres that requires attention. However, such a signal still has significant value in providing important context to decrease the time to triage the issue and address the root cause/escalate.

Data store panels provide a helpful set of metrics that should be consulted first when troubleshooting this alert.

note

Treatment of "resource not found" errors: small number of "not found" responses are normal because TSB, for the purposes of optimisation, often uses Get queries instead of Exists in order to determine the resource existence. However, a large rate of "not found" (404-like) responses likely indicates an issue with the persistent backend setup.

Consult the tsb and iam component logs to narrow down the nature of Postgres errors. To troubleshoot further, verify the health of the Postgres installation.

MPC Component Health

MPC (Management Plane Controller) is a TSB MP component that is responsible for configuration and state validation and bi-directional sync between TSB API and XCP Central component. MPC operational issues indicate that configuration updates may not be applied or applied too slow. MPC Operational Status dashboard provides an extensive list of signals that can be helpful in troubleshooting MPC issues. To troubleshoot further, consult the mpc component logs. Note: MPC uses k8s API as an intermittent configuration storage. Any issues affecting k8s APIServer (namely, its etcd installation) will result in degraded MPC health. MPC logs can be helpful in confirming the health and interaction between MPC and k8s.

XCP Operational Status

Alerts describing the health of TSB Management Plane components.

Last Management Plane Sync

The max time elapsed since XCP Edge last synced with the management plane (XCP central) for each registered cluster. This indicates how stale the configuration received from the management plane is in a given cluster. A reasonable first iteration threshold here is 300 (seconds).

Use "XCP Central Operational Status" and "XCP Edge status" dashboards to narrow down which Control Plane cluster is experiencing syncing issues. Verify the connectivity between the CP cluster and MP. To troubleshoot further, xcp-central MP and edge CP component logs can provide additional insight.