Version: 0.9.x

Key Metrics

Tetrate Service Bridge collects a large amount of metrics, this page details the ones we consider most important. It is generated from dashboards we run internally at Tetrate so will be updated over time based on best practices learned from operational experiences in Tetrate and from user deployments. Each heading represents a different dashboard and each sub-heading is a panel on this dashboard. For this reason, you may see metrics appear multiple times.

Global Configuration Distribution

These metrics indicate the overall health of Service Bridge and should be considered the starting point for any investigation into issues with Service Bridge.

Connected Clusters

This details all clusters that are connected to and receiving configuration from the management plane.

If this number drops below 1 or a given cluster does not appear in this table it means that the cluster is disconnected. This may happen for a brief period of time during upgrades/re-deploys.

Metric Name	Labels	PromQL Expression
`grpc_client_msg_received_total`	`component` `grpc_type`	`count(sum(rate(grpc_client_msg_received_total{component="tsbd", grpc_type="server_stream"}[30s])) by (cluster_name)) by (cluster_name)`

TSB Error Rate (Humans)

Rate of failed requests to the TSB apiserver from the UI and CLI.

Metric Name	Labels	PromQL Expression
`grpc_server_handled_total`	`component` `grpc_code` `grpc_method` `grpc_type`	`sum(rate(grpc_server_handled_total{component="tsb", grpc_code!="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) by (grpc_code) OR on() vector(0)`

Time Since Last Cluster Sync

Each tsbd emits the timestamp at which it last successfully synced. As such, the time displayed here can vary based on how fast the metric can get to this dashboard.

In a fast metrics pipeline, this number will hover around 15s. If a cluster last sync increases by a couple of minutes it should be investigated.

Each cluster sends and receives data from the management plane. It receives the logical model of the service mesh and sends the local cluster physical model (namespaces, services, etc.).

The staler a configuration becomes the more likely it will start to impact the data plane. As such, it is critical to investigate any ongoing staleness.

Metric Name	Labels	PromQL Expression
`tsbd_tsb_latest_sync_time`	N/A	`time() - min(tsbd_tsb_latest_sync_time) by (cluster_name, direction)`

Istio-Envoy Sync Time (99th Percentile)

Once tsbd has synced with the Management plane it creates resources for Istio to configure Envoy. Istio usually distributes these within a second.

If this number starts to exceed 10 seconds then it is likely istiod needs to be scaled out. In small clusters, it is possible this number is too small to be handled by the histogram buckets so may be nil.

Metric Name	Labels	PromQL Expression
`pilot_proxy_convergence_time_bucket`	N/A	`histogram_quantile(0.99, sum(rate(pilot_proxy_convergence_time_bucket[1m])) by (le, cluster_name))`

Istiod Errors

Rate of istiod errors broken down by cluster. This graph helps identify which clusters may be experiencing problems. Typically, there should be no errors any non-transient errors should be investigated.

Sometimes this graph will show "No data" or these metrics won't exist. This is because istiod only emits these metrics if the errors occur.

Metric Name	Labels	PromQL Expression
`pilot_total_xds_internal_errors`	N/A	`sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)`
`pilot_total_xds_rejects`	N/A	`sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)`
`pilot_xds_expired_nonce`	N/A	`sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)`
`pilot_xds_push_context_errors`	N/A	`sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)`
`pilot_xds_pushes`	`type`	`sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)`
`pilot_xds_write_timeout`	N/A	`sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)`

Istio Operational Status

Operational metrics to indicate istiod health.

Connected Envoys

Count of Envoys connected to istiod. This should represent the total number of endpoints in the selected cluster.

If this number significantly decreases for longer than 5 minutes without an obvious reason (e.g. a scale-down event) then you should investigate. This may indicate that Envoys have been disconnected from istiod and are unable to reconnect.

Metric Name	Labels	PromQL Expression
`pilot_xds`	`cluster_name`	`sum(pilot_xds{cluster_name="$cluster"})`

Total Error Rate

The total error rate for Istio when configuring Envoy, including generation and transport.

Any errors (current and historic) should be investigated using the more detailed split below.

Metric Name	Labels	PromQL Expression
`pilot_total_xds_internal_errors`	`cluster_name`	`sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)`
`pilot_total_xds_rejects`	`cluster_name`	`sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)`
`pilot_xds_expired_nonce`	`cluster_name`	`sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)`
`pilot_xds_push_context_errors`	`cluster_name`	`sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)`
`pilot_xds_pushes`	`cluster_name` `type`	`sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)`
`pilot_xds_write_timeout`	`cluster_name`	`sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m])) + sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) OR on() vector(0)`

Median Proxy Convergence Time

The median (50th percentile) delay between configuration changes being received by istiod and a proxy receiving all required configuration in the selected cluster. This number indicates how stale the proxy configuration is. As this number increase it may start to affect application traffic.

This number is typically in the hundreds of milliseconds. In small clusters, it is possible that this number is zero.

If this number creeps up to 30s for an extended period of time, it is likely that istiod needs to be scaled out (or up) as it is likely pinned up against its CPU limits.

Metric Name	Labels	PromQL Expression
`pilot_proxy_convergence_time_bucket`	`cluster_name`	`histogram_quantile(0.5, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))`

Istiod Push Rate

The rate of istiod pushes to Envoy grouped by discovery service. Istiod will push clusters (CDS), endpoints (EDS), listeners (LDS) or routes (RDS) any time it receives some kind of configuration change. This change may have been triggered by a user interacting with TSB or it may have been triggered by a change in infrastructure such as a new endpoint (service instance/pod) being created.

In small relatively static clusters it's possible for these values to be zero most of the time.

Metric Name	Labels	PromQL Expression
`pilot_xds_pushes`	`cluster_name` `type`	`sum(irate(pilot_xds_pushes{cluster_name="$cluster", type=~"cds\|eds\|rds\|lds"}[1m])) by (type)`

Istiod Error Rate

The different error rates for Istio during general operations as well as generation and distribution of Envoy configuration.

pilot_xds_write_timeout Rate of connection timeouts between Envoy and istiod. This number indicates that an Envoy has taken too long to acknowledge a configuration change from Istio. An increase in these errors typically indicates network issues, envoy resource limits or istiod resource limits (usually cpu)

pilot_total_xds_internal_errors Rate of errors thrown inside istiod whilst generating Envoy configuration. Check the istiod logs for more details if you see internal errors.

pilot_total_xds_rejects Rate of rejected configuration from Envoy. Istio should never produce any invalid Envoy configuration so any errors here warrants investigation, starting with the istiod logs.

pilot_xds_expired_nonce Rate of expired nonces from Envoys. This number indicates that an Envoy has responded to the wrong request sent from Istio. An increase in these errors typically indicates network issues (saturation or partition), Envoy resource limits or istiod resource limits (usually cpu).

pilot_xds_push_context_errors Rate of errors setting a connection with an Envoy instance. An increase in these errors typically indicates network issues (saturation or partition), Envoy resource limits or istiod resource limits (usually cpu). Check istiod logs for further details.

pilot_xds_pushes Rate of transport errors sending configuration to Envoy. An increase in these errors typically indicates network issues (saturation or partition), Envoy resource limits or istiod resource limits (usually cpu).

Metric Name	Labels	PromQL Expression
`pilot_total_xds_internal_errors`	`cluster_name`	`sum(rate(pilot_total_xds_internal_errors{cluster_name="$cluster"}[1m]))`
`pilot_total_xds_rejects`	`cluster_name`	`sum(rate(pilot_total_xds_rejects{cluster_name="$cluster"}[1m]))`
`pilot_xds_expired_nonce`	`cluster_name`	`sum(rate(pilot_xds_expired_nonce{cluster_name="$cluster"}[1m]))`
`pilot_xds_push_context_errors`	`cluster_name`	`sum(rate(pilot_xds_push_context_errors{cluster_name="$cluster"}[1m]))`
`pilot_xds_pushes`	`cluster_name` `type`	`sum(rate(pilot_xds_pushes{cluster_name="$cluster", type=~".*_senderr"}[1m])) by (type)`
`pilot_xds_write_timeout`	`cluster_name`	`sum(rate(pilot_xds_write_timeout{cluster_name="$cluster"}[1m]))`

Proxy Convergence Time

The delay between configuration changes being received by istiod and a proxy receiving all required configuration in the selected cluster. Broken down into percentiles.

This number indicates how stale the proxy configuration is. As this number increases it may start to affect application traffic.

This number is typically in the hundreds of milliseconds. If this number creeps up to 30s for an extended period of time, it is likely that istiod needs to be scaled out (or up) as it is likely pinned up against its CPU limits.

Metric Name	Labels	PromQL Expression
`pilot_proxy_convergence_time_bucket`	`cluster_name`	`histogram_quantile(0.5, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))`
`pilot_proxy_convergence_time_bucket`	`cluster_name`	`histogram_quantile(0.90, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))`
`pilot_proxy_convergence_time_bucket`	`cluster_name`	`histogram_quantile(0.99, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))`
`pilot_proxy_convergence_time_bucket`	`cluster_name`	`histogram_quantile(0.999, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster"}[1m])) by (le))`

Configuration Validation

Rate of istio configuration successful and failed validation requests. Configuration validation would be triggered here when TSB configuration is created or updated.

Any failures here should be investigated in the istiod and tsbd logs.

If there are TSB configuration changes being made that affect the selected cluster and the success number is zero then there is an issue with configuration propagation. Check the tsbd logs to debug further.

Metric Name	Labels	PromQL Expression
`galley_validation_failed`	`cluster_name`	`sum(rate(galley_validation_failed{cluster_name="$cluster"}[1m]))`
`galley_validation_passed`	`cluster_name`	`sum(rate(galley_validation_passed{cluster_name="$cluster"}[1m]))`

Sidecar Injection

Rate of sidecar injection requests. Sidecar injection is triggered whenever a new instance/pod is created that you have instructed Istio to add a sidecar. This is typically done via namespace or pod labels.

Any errors displayed here should be investigated further by checking the istiod logs.

Metric Name	Labels	PromQL Expression
`sidecar_injection_failure_total`	`cluster_name`	`sum(rate(sidecar_injection_failure_total{cluster_name="$cluster"}[1m]))`
`sidecar_injection_success_total`	`cluster_name`	`sum(rate(sidecar_injection_success_total{cluster_name="$cluster"}[1m]))`

MPC Operational Status

Operational metrics to indicate Management Plane Controller (MPC) health.

Time since last updates

This metric shows the time elapsed since the last updates between the Management Plane Controller (MPC) and TSB.

There are three types of updates:

Config update sent to XCP - These are config updates that are received from TSB and are then applied in the XCP cluster.
Cluster update pushed from XCP to TSB - These are updates to the cluster status pushed from XCP to TSB. These updates are received from XCP every time the status of a cluster changes (new services are deployed, some services are removed, ports in services have changed, etc).
Cluster update sent to XCP - This is the list of onboarded clusters being reported to XCP. TSB keeps XCP updated with the list of onboarded clusters so that config generation is only executed for those clusters that are enabled in TSB.

Metric Name	Labels	PromQL Expression
`mpc_tsb_latest_sync_time`	N/A	`time() - min(mpc_tsb_latest_sync_time) by (direction, resource)`

Config updates processed every 5m

This is the number of configuration updates that are received by the Management Plane Controller (MPC) to be processed and send to XCP.

TSB sends the config updates over a gRPC stream that is permanently connected to MPC, and this metric shows the number of messages that are received and processed by MPC on that stream.

Metric Name	Labels	PromQL Expression
`permanent_stream_operation`	`error` `name`	`sum(increase(permanent_stream_operation{name="ConfigUpdates", error=""}[5m])) or on() vector(0)`
`permanent_stream_operation`	`error` `name`	`sum(increase(permanent_stream_operation{name="ConfigUpdates", error!=""}[5m])) or on() vector(0)`

Config stream connection attempts every 5m

The number of connection (and reconnection) attempts on the config updates stream.

TSB sends the config updates over a gRPC stream that is permanently connected to MPC. This metric shows the number of connections and reconnections that happened on that stream.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`permanent_stream_connection_attempts`	`error` `name`	`sum(increase(permanent_stream_connection_attempts{name="ConfigUpdates", error=""}[5m])) or on() vector(0)`
`permanent_stream_connection_attempts`	`error` `name`	`sum(increase(permanent_stream_connection_attempts{name="ConfigUpdates", error!=""}[5m])) or on() vector(0)`

permanent_stream_connection_attempts

error name

sum(increase(permanent_stream_connection_attempts{name="ConfigUpdates", error=""}[5m])) or on() vector(0)

permanent_stream_connection_attempts

error name

sum(increase(permanent_stream_connection_attempts{name="ConfigUpdates", error!=""}[5m])) or on() vector(0)

Config updates applied every 5m

The number of config updates that have been processed and applied.

MPC constantly receives updates on config resources from TSB. This metric shows the number of config updates processed and applied to XCP.

Metric Name	Labels	PromQL Expression
`mpc_xcp_config_push_count`	`error`	`sum(increase(mpc_xcp_config_push_count{error =""}[5m]))`
`mpc_xcp_config_push_count`	`error`	`sum(increase(mpc_xcp_config_push_count{error !=""}[5m])) or on() vector(0)`

XCP Resource conversion rate

Configuration updates received from TSB are processed by MPC and translated into XCP resources. This metric shows the conversion rate of TSB resources to XCP resources.

This can give a good idea of the number of resources of each kind that exists in the runtime configuration.

Metric Name	Labels	PromQL Expression
`mpc_xcp_conversion_count`	N/A	`sum(rate(mpc_xcp_conversion_count[1m])) by (resource)`

XCP Resource conversion error rate

Configuration updates received from TSB are processed by MPC and translated into XCP resources. This metric shows the conversion error rate of TSB resources to XCP resources.

This should be always zero. If there are errors reported in this graph, it means that there are incompatibilities between the XCP resources and the TSB ones. This could be caused by mismatching version compatibility between TSB and XCP,.

Metric Name	Labels	PromQL Expression
`mpc_xcp_conversion_count`	`error`	`sum(rate(mpc_xcp_conversion_count{error != ""}[1m])) by (resource) or on() vector(0)`

TSB Cluster updates processed every 5m

This is the number of cluster updates that are received by the Management Plane Controller (MPC) to be processed and send to XCP.

TSB sends the cluster updates (new onboarded clusters, deleted clusters)) over a gRPC stream that is permanently connected to MPC, and this metric shows the number of messages that are received and processed by MPC on that stream.

Metric Name	Labels	PromQL Expression
`permanent_stream_operation`	`error` `name`	`sum(increase(permanent_stream_operation{name="ClusterPush", error=""}[5m])) or on() vector(0)`
`permanent_stream_operation`	`error` `name`	`sum(increase(permanent_stream_operation{name="ClusterPush", error!=""}[5m])) or on() vector(0)`

TSB Cluster stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster updates stream.

TSB sends the cluster updates over a gRPC stream that is permanently connected to MPC. This metric shows the number of connections and reconnections that happened on that stream.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`permanent_stream_connection_attempts`	`error` `name`	`sum(increase(permanent_stream_connection_attempts{name="ClusterPush", error=""}[5m])) or on() vector(0)`
`permanent_stream_connection_attempts`	`error` `name`	`sum(increase(permanent_stream_connection_attempts{name="ClusterPush", error!=""}[5m])) or on() vector(0)`

permanent_stream_connection_attempts

error name

sum(increase(permanent_stream_connection_attempts{name="ClusterPush", error=""}[5m])) or on() vector(0)

permanent_stream_connection_attempts

error name

sum(increase(permanent_stream_connection_attempts{name="ClusterPush", error!=""}[5m])) or on() vector(0)

TSB Cluster updates applied every 5m

The number of cluster updates that have been processed and applied.

MPC constantly receives updates on the onboarded cluster from TSB. This metric shows the number of cluster updates processed and applied to XCP.

Metric Name	Labels	PromQL Expression
`mpc_xcp_cluster_push_count`	`error`	`sum(increase(mpc_xcp_cluster_push_count{error =""}[5m]))`
`mpc_xcp_cluster_push_count`	`error`	`sum(increase(mpc_xcp_cluster_push_count{error !=""}[5m])) or on() vector(0)`

XCP cluster status updates processed every 5m

This is the number of cluster status updates that are processed by the Management Plane Controller (MPC) to be sent to TSB.

MPC sends the cluster status updates over a gRPC stream that is permanently connected to TSB, and this metric shows the number of cluster updates that are processed by MPC and sent to TSB on that stream.

Metric Name	Labels	PromQL Expression
`permanent_stream_operation`	`error` `name`	`sum(increase(permanent_stream_operation{name="ClusterUpdates", error=""}[5m])) or on() vector(0)`
`permanent_stream_operation`	`error` `name`	`sum(increase(permanent_stream_operation{name="ClusterUpdates", error!=""}[5m])) or on() vector(0)`

XCP cluster status updates stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster status updates stream.

MPC sends the cluster status updates over a gRPC stream that is permanently connected to TSB. This metric shows the number of connections and reconnections that happened on that stream.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`permanent_stream_connection_attempts`	`error` `name`	`sum(increase(permanent_stream_connection_attempts{name="ClusterUpdates", error=""}[5m])) or on() vector(0)`
`permanent_stream_connection_attempts`	`error` `name`	`sum(increase(permanent_stream_connection_attempts{name="ClusterUpdates", error!=""}[5m])) or on() vector(0)`

permanent_stream_connection_attempts

error name

sum(increase(permanent_stream_connection_attempts{name="ClusterUpdates", error=""}[5m])) or on() vector(0)

permanent_stream_connection_attempts

error name

sum(increase(permanent_stream_connection_attempts{name="ClusterUpdates", error!=""}[5m])) or on() vector(0)

OAP Operational Status

Operational metrics to indicate Tetrate Service Bridge OAP stack health.

OAP Request Rate

The request rate to OAP, by status.

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_rq_xx`	`envoy_cluster_name` `plane`	`sum by (envoy_response_code_class) (rate(envoy_cluster_upstream_rq_xx{envoy_cluster_name="oap-grpc", plane="management"}[1m]))`

OAP Request Latency

The OAP, request latency.

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	`histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))`
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	`histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))`
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	`histogram_quantile(0.90, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))`
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	`histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))`
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	`histogram_quantile(0.50, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))`

OAP Aggregation Request Rate

OAP Aggregation Request Rate, by type:

central aggregation service handler received
central application aggregation received
central service aggregation received

Metric Name	Labels	PromQL Expression
`central_aggregation_handler`	N/A	`sum(rate(central_aggregation_handler[1m]))`
`central_app_aggregation`	N/A	`sum(rate(central_app_aggregation[1m]))`
`central_service_aggregation`	N/A	`sum(rate(central_service_aggregation[1m]))`

OAP Aggregation Rows

Cumulative rate of rows in OAP aggreagation.

Metric Name	Labels	PromQL Expression
`metrics_aggregation`	`plane`	`sum(rate(metrics_aggregation{plane="management"}[1m]))`

OAP Mesh Analysis Latency

The process latency of OAP service mesh telemetry streaming process.

Metric Name	Labels	PromQL Expression
`mesh_analysis_latency_bucket`	`component` `plane`	`histogram_quantile(0.99, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))`
`mesh_analysis_latency_bucket`	`component` `plane`	`histogram_quantile(0.95, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))`
`mesh_analysis_latency_bucket`	`component` `plane`	`histogram_quantile(0.90, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))`
`mesh_analysis_latency_bucket`	`component` `plane`	`histogram_quantile(0.75, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))`

JVM Threads

Numbed of threads in OAP JVM

Metric Name	Labels	PromQL Expression
`jvm_threads_current`	`component` `plane`	`sum(jvm_threads_current{component="oap", plane="management"})`
`jvm_threads_daemon`	`component` `plane`	`sum(jvm_threads_daemon{component="oap", plane="management"})`
`jvm_threads_deadlocked`	`component` `plane`	`sum(jvm_threads_deadlocked{component="oap", plane="management"})`
`jvm_threads_peak`	`component` `plane`	`sum(jvm_threads_peak{component="oap", plane="management"})`

JVM Memory

JVM Memory stats of OAP JVM instances.

Metric Name	Labels	PromQL Expression
`jvm_memory_bytes_max`	`component` `plane`	`sum by (area, instance) (jvm_memory_bytes_max{component="oap", plane="management"})`
`jvm_memory_bytes_used`	`component` `plane`	`sum by (area, instance) (jvm_memory_bytes_used{component="oap", plane="management"})`

TSB Operational Status

Operational metrics to indicate Tetrate Service Bridge API server health.

AuthZ Success Rate

Rate of successful requests to the AuthZ server. This includes all user and cluster requests into the management plane.

Note: This indicates the health of the AuthZ server not whether the user or cluster making the request has the correct permissions to do so.

Metric Name	Labels	PromQL Expression
`envoy_cluster_internal_upstream_rq`	`envoy_response_code`	`sum(rate(envoy_cluster_internal_upstream_rq{envoy_response_code=~"2.*"}[1m])) by (envoy_cluster_name)`

AuthZ Error Rate

The error rate of requests to the AuthZ server. This includes all user and cluster requests into the management plane.

Note: This indicates the health of the AuthZ server not whether the user or cluster making the request has the correct permissions to do so.

Metric Name	Labels	PromQL Expression
`envoy_cluster_internal_upstream_rq`	`envoy_response_code`	`sum(rate(envoy_cluster_internal_upstream_rq{envoy_response_code!~"2.*"}[1m])) by (envoy_cluster_name)`

AuthZ Latency

AuthZ request latency percentiles.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`envoy_cluster_internal_upstream_rq_time_bucket`	N/A	`histogram_quantile(0.99, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket[1m])) by (le, envoy_cluster_name))`
`envoy_cluster_internal_upstream_rq_time_bucket`	N/A	`histogram_quantile(0.95, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket[1m])) by (le, envoy_cluster_name))`

envoy_cluster_internal_upstream_rq_time_bucket

N/A

histogram_quantile(0.99, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket[1m])) by (le, envoy_cluster_name))

envoy_cluster_internal_upstream_rq_time_bucket

N/A

histogram_quantile(0.95, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket[1m])) by (le, envoy_cluster_name))

TSB Success Rate

Rate of successful requests to the TSB apiserver from the UI and CLI.

Metric Name	Labels	PromQL Expression
`grpc_server_handled_total`	`component` `grpc_code` `grpc_method` `grpc_type`	`sum(rate(grpc_server_handled_total{component="tsb", grpc_code="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) by (grpc_method)`

TSB Error Rate

Rate of failed requests to the TSB apiserver from the UI and CLI.

Metric Name	Labels	PromQL Expression
`grpc_server_handled_total`	`component` `grpc_code` `grpc_method` `grpc_type`	`sum(rate(grpc_server_handled_total{component="tsb", grpc_code!="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) by (grpc_code)`

Data Store Success Rate

Successful request rate for operations persisting data to the datastore grouped by method and kind.

This graph also includes transactions. These are standard SQL transactions and consist of multiple operations.

Metric Name	Labels	PromQL Expression
`persistence_operation`	`error`	`sum(rate(persistence_operation{error=""}[1m])) by (kind, method)`
`persistence_transaction`	`error`	`sum(rate(persistence_transaction{error=""}[1m]))`

Data Store Latency

The request latency for operations persisting data to the datastore grouped by method.

This graph also includes transactions. These are standard SQL transactions and consist of multiple operations.

Metric Name	Labels	PromQL Expression
`persistence_operation_duration_bucket`	N/A	`histogram_quantile(0.99, sum(rate(persistence_operation_duration_bucket[1m])) by (le, method))`
`persistence_transaction_duration_bucket`	N/A	`histogram_quantile(0.99, sum(rate(persistence_transaction_duration_bucket[1m])) by (le))`

Data Store Error Rate

The request error rate for operations persisting data to the datastore grouped by method and kind.

NB: the graph explicitly excludes "resource not found" errors. Some level of "not found" responses is normal as TSB for the purposes of optimisation often uses Get queries instead of Exists in order to determine the resource existence.

This graph also includes transactions. These are standard SQL transactions and consists of multiple operations.

Metric Name	Labels	PromQL Expression
`persistence_operation`	`error`	`sum(rate(persistence_operation{error!="", error!="resource not found"}[1m])) by (kind, method)`
`persistence_transaction`	`error`	`sum(rate(persistence_transaction{error!=""}[1m]))`

Active Transactions

Number of running transactions on the datastore.

This graph shows how many active transactions are running at a given point in time. It can be useful to understand the load of the system generated by concurrent access to the platform.

Metric Name	Labels	PromQL Expression
`persistence_concurrent_transaction`	N/A	`sum(persistence_concurrent_transaction)`

Dual-Write Operations Request Rate

The request rate for operations persisting data to the Q Graph or Persistent Data Store via dual-write framework. Dual-writes ensure Zero Downtime Data model migrations.

This graph consists of total request rate grouped by the write stage (primary/secondary) as well as error rate grouped by stage/error code.

primary writes are always executed synchronously, and any failure in a primary write will manifest as well as an API error.
secondary writes are done in the background and do not manifest in direct API errors. Failures are allowed here, and the data reconcile process will fix any inconsistencies between the primary and secondary models.

Metric Name	Labels	PromQL Expression
`dualop_operation`	`stage`	`sum(rate(dualop_operation{stage!=""}[1m])) by (stage)`
`dualop_operation`	`error` `stage`	`sum(rate(dualop_operation{stage!="", error!=""}[1m])) by (stage, error)`

Dual-Write Operations Latency

The request latency for operations persisting data to the Q Graph or Persistent Data Store via dual-write framework. Dual-writes ensure Zero Downtime Data model migrations.

primary writes are always executed synchronously, and any failure in a primary write will manifest as well as an API error.
secondary writes are done in the background and do not manifest in direct API errors. Failures are allowed here, and the data reconcile process will fix any inconsistencies between the primary and secondary models.

Metric Name	Labels	PromQL Expression
`dualop_operation_duration_bucket`	`stage`	`histogram_quantile(0.99, sum(rate(dualop_operation_duration_bucket{stage!=""}[1m])) by (le, stage))`
`dualop_operation_duration_bucket`	`stage`	`histogram_quantile(0.95, sum(rate(dualop_operation_duration_bucket{stage!=""}[1m])) by (le, stage))`
`dualop_operation_duration_bucket`	`stage`	`histogram_quantile(0.90, sum(rate(dualop_operation_duration_bucket{stage!=""}[1m])) by (le, stage))`
`dualop_operation_duration_bucket`	`stage`	`histogram_quantile(0.75, sum(rate(dualop_operation_duration_bucket{stage!=""}[1m])) by (le, stage))`

PDP Success Rate

Successful request rate of PDP grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

A drop in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads and this is usually a consequence of failures in the PIP. Failures in the PIP for write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric Name	Labels	PromQL Expression
`ngac_pdp_operation`	`error`	`sum(rate(ngac_pdp_operation{error=""}[1m])) by (method)`

PDP Error Rate

Rate of errors for PDP requests grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

Failed requests to the PDP show the number of requests from the PEP to the PDP that have failed. They do not represent "access denied" decisions; they just represent the access decision requests for which a verdict could not be obtained.

A raise in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads and this is usually a consequence of failures in the PIP. Failures in the PIP for write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric Name	Labels	PromQL Expression
`ngac_pdp_operation`	`error`	`sum(rate(ngac_pdp_operation{error!=""}[1m])) by (method)`

PDP Latency

PDP latency percentiles grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

This metric shows the time it takes to get an access decision for authorization requests.

Degradation in PDP operations may result in general degradation of the system. PDP latency represents the time it takes to make access decisions, and that will impact user experience, since access decisions are made and enforced for every operation.

Metric Name	Labels	PromQL Expression
`ngac_pdp_operation_duration_bucket`	N/A	`histogram_quantile(0.99, sum(rate(ngac_pdp_operation_duration_bucket[1m])) by (method, le))`
`ngac_pdp_operation_duration_bucket`	N/A	`histogram_quantile(0.95, sum(rate(ngac_pdp_operation_duration_bucket[1m])) by (method, le))`

PIP Success Rate

Successful request rate of PIP grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

PIP operations are executed against the NGAC graph to represent and maintain the objects in the system and the relationships between them.

A drop in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads, or that it is failing to persist data. Read failures may result in failed access decisions (in the PDP) and user interaction with the system may be rejected as well. Failures in write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric Name	Labels	PromQL Expression
`ngac_pip_operation`	`error`	`sum(rate(ngac_pip_operation{error=""}[1m])) by (method)`

PIP Latency

PiP latency percentiles grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

This metric shows the time it takes for a PIP operation to complete and, in the case of write operations, to have data persisted in the NGAC graph.

Degradation in PIP operations may result in general degradation of the system. PIP latency represents the time it takes to access the NGAC graph, and this directly affects the PDP when running access decisions. A degraded PIP may result in a degraded PDP, and that will impact user experience, as access decisions are made and enforced for every operation.

Metric Name	Labels	PromQL Expression
`ngac_pip_operation_duration_bucket`	N/A	`histogram_quantile(0.99, sum(rate(ngac_pip_operation_duration_bucket[1m])) by (method, le))`
`ngac_pip_operation_duration_bucket`	N/A	`histogram_quantile(0.95, sum(rate(ngac_pip_operation_duration_bucket[1m])) by (method, le))`

PIP Error Rate

Rate of errors for PIP requests grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

PIP operations are executed against the NGAC graph to represent and maintain the objects in the system and the relationships between them.

Note: the "Node not found" errors are explicitly excluded as TSB often uses GetNode method instead of Exists to determine the node existence, for the purposes of optimisation.

A general raise in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads, or that it is failing to persist data. Read failures may result in failed access decisions (in the PDP) and user interaction with the system may be rejected as well. Failures in write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric Name	Labels	PromQL Expression
`ngac_pip_operation`	`error`	`sum(rate(ngac_pip_operation{error!="", error!="Node not found"}[1m])) by (method)`

Active PIP Transactions

Number of running transactions on the NGAC PIP.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

This metric shows the number of active write operations against the NGAC graph. It can be useful to understand the load of the system generated by concurrent access to the platform.

Metric Name	Labels	PromQL Expression
`ngac_pip_concurrent_transaction`	N/A	`sum(ngac_pip_concurrent_transaction)`

tsbd Operational Status

Operational metrics to indicate tsbd health.

Istio Resource Count

The number of Istio resources generated by tsbd. This should change when TSB configuration changes or when cluster state changes (new pods, services, etc).

If this number falls without any explanation it should be investigated.

Metric Name	Labels	PromQL Expression
`tsbd_istioresources_kubeapply_count`	`cluster_name`	`sum(rate(tsbd_istioresources_kubeapply_count{cluster_name="$cluster"}[1m])) by (type)`

Kubernetes Events Received

The events received from Kubernetes. tsbd watches for events in order to update the Management plane on it's local state.

Metric Name	Labels	PromQL Expression
`pilot_k8s_reg_events`	`cluster_name`	`sum(rate(pilot_k8s_reg_events{cluster_name="$cluster"}[1m])) by (event, type)`

Last Management Plane Sync

The time elapsed since tsbd last synced with the management plane. As this number increases, the more stale configuration within the cluster is. tsbd can operate without configuration updates but over time this will start impacting the data plane (application) traffic. Any values > 90s should be investigated.

Inbound is configuration that tsbd receives about the logical (data) model and other clusters from the management plane.

Outbound is configuration that tsbd sent about the physical model (namespaces, services, etc) to the management plane.

Metric Name	Labels	PromQL Expression
`tsbd_tsb_latest_sync_time`	`cluster_name`	`time() - min(tsbd_tsb_latest_sync_time{cluster_name="$cluster"}) by (direction, resource_type)`

Configuration Received

Configuration messages received from the management plane. For a more detailed explanation see the above Management plane sync section.

This should remain constant (except during updates) at 0.2 reqps as tsbd receives configuration every 5 seconds.

Metric Name	Labels	PromQL Expression
`grpc_client_msg_received_total`	`cluster_name`	`sum(rate(grpc_client_msg_received_total{cluster_name="$cluster"}[1m])) by (grpc_method)`

Configuration Sent

Configuration messages sent to the management plane. For a more detailed explanation see the above Management plane sync section.

This should remain constant (except during updates) at 0.2 reqps as tsbd sends configuration every 5 seconds.

Metric Name	Labels	PromQL Expression
`grpc_client_msg_sent_total`	`cluster_name` `grpc_method`	`sum(rate(grpc_client_msg_sent_total{cluster_name="$cluster", grpc_method="UpdateClusterResources"}[1m])) by (grpc_method)`

Zipkin Operational status

Operational metrics to indicate Tetrate Service Bridge Zipkin stack health.

Requests per second

Rate of HTTP requests to Zipkin by method, URL and response code.

Metric Name	Labels	PromQL Expression
`http_server_requests_seconds_count`	`component` `plane`	`sum by(method, uri, status) (rate(http_server_requests_seconds_count{component="zipkin", plane="management"}[1m]))`

Requests latency

Latency of HTTP requests to Zipkin.

Metric Name	Labels	PromQL Expression
`http_server_requests_seconds_bucket`	`component` `plane`	`histogram_quantile(0.99 , sum(rate(http_server_requests_seconds_bucket{component="zipkin", plane="management"}[1m])) by (le))`
`http_server_requests_seconds_bucket`	`component` `plane`	`histogram_quantile(0.95 , sum(rate(http_server_requests_seconds_bucket{component="zipkin", plane="management"}[1m])) by (le))`
`http_server_requests_seconds_bucket`	`component` `plane`	`histogram_quantile(0.75 , sum(rate(http_server_requests_seconds_bucket{component="zipkin", plane="management"}[1m])) by (le))`
`http_server_requests_seconds_bucket`	`component` `plane`	`histogram_quantile(0.50 , sum(rate(http_server_requests_seconds_bucket{component="zipkin", plane="management"}[1m])) by (le))`

Dropped messages/spans

The rate of messages and spans dropped by Zipkin. Note: a span could be dropped if it's a duplicate.

Metric Name	Labels	PromQL Expression
`zipkin_collector_messages_dropped_total`	`plane`	`sum(rate(zipkin_collector_messages_dropped_total{plane="management"}[5m]))`
`zipkin_collector_spans_dropped_total`	`plane`	`sum(rate(zipkin_collector_spans_dropped_total{plane="management"}[5m]))`

Elasticsearch requests

The rate of Zipkin requests to Elasticsearch backend, by method and result.

Metric Name	Labels	PromQL Expression
`elasticsearch_requests_total`	`component` `plane`	`sum by (method, result) (rate(elasticsearch_requests_total{component="zipkin", plane="management"}[1m]))`

Zipkin Collector Throughput

Cumulative spans and messages read by Zipkin collector; relates to messages reported by instrumented apps

Metric Name	Labels	PromQL Expression
`zipkin_collector_message_spans`	`plane`	`sum (zipkin_collector_message_spans{plane="management"})`
`zipkin_collector_spans_total`	`plane`	`sum (rate(zipkin_collector_spans_total{plane="management"}[5m]))`

Zipkin Bytes in Message

Last size of a message received by Zipkin Collector.

Metric Name	Labels	PromQL Expression
`zipkin_collector_message_bytes`	`plane`	`sum(zipkin_collector_message_bytes{plane="management"})`

Zipkin bytes/sec

Cumulative rate of data received by Zipkin; should relate to messages reported by instrumented apps.

Metric Name	Labels	PromQL Expression
`zipkin_collector_bytes_total`	`plane`	`sum(rate(zipkin_collector_bytes_total{plane="management"}[5m]))`

Zipkin Spans in Message

Last count of spans in a message received by Zipkin Collector.

Metric Name	Labels	PromQL Expression
`zipkin_collector_message_spans`	`plane`	`sum(zipkin_collector_message_spans{plane="management"})`

Threads

The number of threads in Zipkin by status.

Metric Name	Labels	PromQL Expression
`jvm_threads_daemon_threads`	`component` `plane`	`sum(jvm_threads_daemon_threads{component="zipkin", plane="management"})`
`jvm_threads_live_threads`	`component` `plane`	`sum(jvm_threads_live_threads{component="zipkin", plane="management"})`
`jvm_threads_peak_threads`	`component` `plane`	`sum(jvm_threads_peak_threads{component="zipkin", plane="management"})`
`jvm_threads_states_threads`	`component` `plane`	`jvm_threads_states_threads{component="zipkin", plane="management"}`

Garbage Collection

Max GC Pause on Zipkin by cause.

Metric Name	Labels	PromQL Expression
`jvm_gc_pause_seconds_max`	`component` `plane`	`sum by (cause) (jvm_gc_pause_seconds_max{component="zipkin", plane="management"})`

JVM Classes

The number of classes that are currently loaded in the Zipkin JVM.

Metric Name	Labels	PromQL Expression
`jvm_classes_loaded_classes`	`component` `plane`	`sum (jvm_classes_loaded_classes{component="zipkin", plane="management"})`
`jvm_classes_unloaded_classes_total`	`component` `plane`	`sum (jvm_classes_unloaded_classes_total{component="zipkin", plane="management"})`

JVM Memory

JVM Memory stats for Zipkin instance.

Metric Name	Labels	PromQL Expression
`jvm_buffer_total_capacity_bytes`	`component` `plane`	`sum by (id, instance) (jvm_buffer_total_capacity_bytes{component="zipkin", plane="management"})`
`jvm_memory_max_bytes`	`component` `plane`	`sum by (area, instance) (jvm_memory_max_bytes{component="zipkin", plane="management"})`

Global Configuration Distribution​

Connected Clusters​

TSB Error Rate (Humans)​

Time Since Last Cluster Sync​

Istio-Envoy Sync Time (99th Percentile)​

Istiod Errors​

Istio Operational Status​

Connected Envoys​

Total Error Rate​

Median Proxy Convergence Time​

Istiod Push Rate​

Istiod Error Rate​

Proxy Convergence Time​

Configuration Validation​

Sidecar Injection​

MPC Operational Status​

Time since last updates​

Config updates processed every 5m​

Config stream connection attempts every 5m​

Config updates applied every 5m​

XCP Resource conversion rate​

XCP Resource conversion error rate​

TSB Cluster updates processed every 5m​

TSB Cluster stream connection attempts every 5m​

TSB Cluster updates applied every 5m​

XCP cluster status updates processed every 5m​

XCP cluster status updates stream connection attempts every 5m​

OAP Operational Status​

OAP Request Rate​

OAP Request Latency​

OAP Aggregation Request Rate​

OAP Aggregation Rows​

OAP Mesh Analysis Latency​

JVM Threads​

JVM Memory​

TSB Operational Status​

AuthZ Success Rate​

AuthZ Error Rate​

AuthZ Latency​

TSB Success Rate​

TSB Error Rate​

Data Store Success Rate​

Data Store Latency​

Data Store Error Rate​

Active Transactions​

Dual-Write Operations Request Rate​

Dual-Write Operations Latency​

PDP Success Rate​

PDP Error Rate​

PDP Latency​

PIP Success Rate​

PIP Latency​

PIP Error Rate​

Active PIP Transactions​

tsbd Operational Status​

Istio Resource Count​

Kubernetes Events Received​

Last Management Plane Sync​

Configuration Received​

Configuration Sent​

Zipkin Operational status​

Requests per second​

Requests latency​

Dropped messages/spans​

Elasticsearch requests​

Zipkin Collector Throughput​

Zipkin Bytes in Message​

Zipkin bytes/sec​

Zipkin Spans in Message​

Threads​

Garbage Collection​

JVM Classes​

JVM Memory​

Global Configuration Distribution

Connected Clusters

TSB Error Rate (Humans)

Time Since Last Cluster Sync

Istio-Envoy Sync Time (99th Percentile)

Istiod Errors

Istio Operational Status

Connected Envoys

Total Error Rate

Median Proxy Convergence Time

Istiod Push Rate

Istiod Error Rate

Proxy Convergence Time

Configuration Validation

Sidecar Injection

MPC Operational Status

Time since last updates

Config updates processed every 5m

Config stream connection attempts every 5m

Config updates applied every 5m

XCP Resource conversion rate

XCP Resource conversion error rate

TSB Cluster updates processed every 5m

TSB Cluster stream connection attempts every 5m

TSB Cluster updates applied every 5m

XCP cluster status updates processed every 5m

XCP cluster status updates stream connection attempts every 5m

OAP Operational Status

OAP Request Rate

OAP Request Latency

OAP Aggregation Request Rate

OAP Aggregation Rows

OAP Mesh Analysis Latency

JVM Threads

JVM Memory

TSB Operational Status

AuthZ Success Rate

AuthZ Error Rate

AuthZ Latency

TSB Success Rate

TSB Error Rate

Data Store Success Rate

Data Store Latency

Data Store Error Rate

Active Transactions

Dual-Write Operations Request Rate

Dual-Write Operations Latency

PDP Success Rate

PDP Error Rate

PDP Latency

PIP Success Rate

PIP Latency

PIP Error Rate

Active PIP Transactions

tsbd Operational Status

Istio Resource Count

Kubernetes Events Received

Last Management Plane Sync

Configuration Received

Configuration Sent

Zipkin Operational status

Requests per second

Requests latency

Dropped messages/spans

Elasticsearch requests

Zipkin Collector Throughput

Zipkin Bytes in Message

Zipkin bytes/sec

Zipkin Spans in Message

Threads

Garbage Collection

JVM Classes

JVM Memory