Alerting Guidelines
Tetrate Service Bridge collects a large number of metrics and the relationship between those, and the threshold limits that you set will differ from environment to environment. This document outlines the generic alerting guidelines rather than providing an exhaustive list of alert configurations and thresholds, since these will differ between different environments with different workload configurations.
The alerting configuration using Prometheus Operator CR can be downloaded from here.
For more information on how to configure PrometheusRules
, refer to official Prometheus documentation.
TSB Operational Status
Alerts describing the health of TSB Management Plane components.
TSB Absent Metrics
TSB talks to its persistent backend even without constant external load. An absence of requests reliably indicates an issue with TSB metrics collection, and should be treated as a high-priority incident as the lack of metrics means the loss of visibility into TSB's status.
Check the health of the metrics ingestion infrastructure - from TSB's
otel-collector
to your Prometheus installation.
TSB Request Traffic
The raw rate of requests to TSB API. The monitoring value comes principally from detecting outliers and unexpected behaviour, e.g. the rate of requests dropping to 0.
Things to check:
- health of the cluster ingress.
- health of TSB ingress infrastructure (e.g. AWS ALB etc).
- health of the metrics ingestion and processing infrastructure, i.e. Prometheus.
- possible DNS misconfiguration.
- inspect the TSB Front Envoy logs for any additional clues (e.g. TSL termination errors, TSB cluster network partitioning signs etc).
TSB API Availability
The rate of successful requests to TSB API. This is an extremely user-visible signal and should be treated as such.
Drill down to the "TSB Error Rate" panel in the "TSB Operational Status"
dashboard to find which gRPC method is
producing errors. To troubleshoot further, consult the deployments/tsb
logs.
For example, if the leading source of errors is e.g. GetTenant
, searching the
tsb
component logs for the lines containing this method may provide
additional insight into the nature of the errors.
Front Envoy Availability
The rate of successful requests to TSB Management Plane Front Envoy. TSB
Management Plane Envoy Gateway (front-envoy
) terminates all incoming requests
to TSB Management Plane components. This alert is MUST be inhibited if the "TSB
API Availability" alert is firing. Refer to Alertmanager
docs
on how to configure alert inhibition.
The monitoring value of this alert is "catch high error rates of non-API TSB
components, e.g. Control Plane sync, metrics propagation etc".
Drill down to the "Front Envoy Error Rate" panel in the "TSB Operational
Status" dashboard to to narrow down
which Envoy cluster
is experiencing issues. Envoy cluster
names correspond
to the TSB Management Plane components, as described in the TSB
architecture document.
TSB Front Envoy p99 Latency
The highest (99th percentile) latency of requests served by TSB Front Envoy. These include both the API requests (via UI or CLI) and XCP configuration propagation-related requests. High latency may be an indicator of one of TSB components overload/resource saturation and may impact both API experience (slow UI/CLI) and config propagation.
Drill down to the "Front Envoy Latency" panel in the "TSB Operational Status"
dashboard to narrow down which Envoy
cluster
is experiencing issues. Envoy cluster
names correspond to the TSB
Management Plane components, as described in the TSB
architecture document.
Further inspecting the relevant TSB Component logs may provide additional insight.
Also, it is always a good idea to inspect the "Data Store Latency" panel and search for any correlations between the increased TSB latency and Postgres operations latency.
Persistent Backend Availability
Persistent backend availability from TSB with no insight into the internal Postgres operations.
TSB stores all of its state in the persistent backend and as such, its operational status (availability, latency, throughput etc.) is tightly coupled with the status of the persistent backend. TSB records the metrics for persistent backend operations that may be used as a signal to alert on.
It is important to note that any degradation in persistent backend operations will inevitably lead to overall TSB degradation, be it availability, latency or throughput. This means that alerting on persistent backend status may be redundant and the oncall person will receive two pages instead of one whenever there is a problem with Postgres that requires attention. However, such a signal still has significant value in providing important context to decrease the time to triage the issue and address the root cause/escalate.
Data store panels provide a helpful set of metrics that should be consulted first when troubleshooting this alert.
Treatment of "resource not found" errors: small number of "not found" responses
are normal because TSB, for the purposes of optimisation, often uses Get
queries instead of Exists
in order to determine the resource existence.
However, a large rate of "not found" (404-like) responses likely indicates an
issue with the persistent backend setup.
Consult the tsb
and iam
component logs to narrow down the nature of
Postgres errors. To troubleshoot further, verify the health of the Postgres
installation.
MPC Component Health
MPC (Management Plane Controller) is a TSB MP component that is responsible for
configuration and state validation and bi-directional sync between TSB API and
XCP Central component. MPC operational issues indicate that configuration
updates may not be applied or applied too slow.
MPC Operational Status dashboard provides
an extensive list of signals that can be helpful in troubleshooting MPC issues.
To troubleshoot further, consult the mpc
component logs.
Note: MPC uses k8s API as an intermittent configuration storage. Any issues
affecting k8s APIServer (namely, its etcd
installation) will result in
degraded MPC health. MPC logs can be helpful in confirming the health and
interaction between MPC and k8s.
XCP Operational Status
Alerts describing the health of TSB Management Plane components.
Last Management Plane Sync
The max time elapsed since XCP Edge last synced with the management plane (XCP central)
for each registered cluster. This indicates how stale the configuration received from the
management plane is in a given cluster. A reasonable first iteration threshold
here is 300
(seconds).
Use "XCP Central Operational Status" and "XCP Edge status" dashboards to
narrow down which Control Plane cluster is experiencing syncing issues.
Verify the connectivity between the CP cluster and MP.
To troubleshoot further, xcp-central
MP and edge
CP component logs can
provide additional insight.