Tetrate Service BridgeVersion: 1.14.x

Key Metrics

Tetrate Service Bridge collects a large number of metrics. This page is generated from dashboards ran internally at Tetrate and will be updated periodically based on best practices learned from operational experiences in Tetrate and from user deployments. Each heading represents a different dashboard, and each sub-heading is a panel on this dashboard. For this reason, you may see metrics appear multiple times.

The metrics in this document refer to TSB components, so be sure to check the TSB architecture to get a good understanding of each component and its function.

The list of available dashboards can be obtained with the tctl experimental grafana dashboard command. To download a dashboard in JSON format to upload it to Grafana, you can run the command as follows:

tctl experimental grafana dashboard <dashboard file name> -o json

You can also upload de dashboards directly to your Grafana instance using the tctl experimental grafana upload command.

Dashboard 1

Control Plane Operator metrics

Dashboard to show the status of control plane operator metrics.

Control Plane Mode

This panel represents the control plane mode configured in the control plane cluster.

Possible values are:

CONTROL(Default)
OBSERVE

Metric Name	Labels	PromQL Expression
`control_plane_mode`	N/A	control_plane_mode == 1

Closest Token To Expire

The 10 clusters that have a token that will expire the soonest.
Tokens are rotated halfway through their validity period. By default, tokens are valid for one hour, so they are rotated 30 minutes after creation. If a cluster's time is displayed in red or with a negative value, it indicates that at least one token has expired without being rotated. In such cases, please review the "Token Rotations Execution Failed" chart to check for any failed token rotation attempts.

Metric Name	Labels	PromQL Expression
`token_expiration_timestamp`	`cluster_name`	max(max_over_time(token_expiration_timestamp{cluster_name=~"${cluster}"}[1m])) by (cluster_name) - time()

Clusters Not Validating Tokens

Each cluster should validate all its token every 60 seconds. If a cluster shows up it means that the last time it validated the tokens and reported the result is over 120 seconds ago.

This doesn't necessarily mean that any token is invalid or is not being rotated properly. For example, the metrics may not have been reported correctly or in a timely manner, or that there is a problem with the TSB Control Plane operator.

Metric Name	Labels	PromQL Expression
`last_tokens_validation_timestamp`	`cluster_name`	max(time() - max_over_time(last_tokens_validation_timestamp{cluster_name=~"${cluster}"}[1m])) by (cluster_name)

Valid Tokens

Number of valid control plane tokens.

Metric Name	Labels	PromQL Expression
`valid_tokens`	`cluster_name`	max(max_over_time(valid_tokens{cluster_name=~"${cluster}"}[1m])) by (cluster_name)

Token Rotation Executions

Number of token rotation executions. An execution doesn't necessarily mean than a token has been rotated as they might still be valid.

Metric Name	Labels	PromQL Expression
`token_rotation_executions_count_total`	`cluster_name`	sum(increase(token_rotation_executions_count_total{cluster_name=~"${cluster}"}[1m])) by(cluster_name)

Token Rotation Execution Failed

Number of failed token rotation executions. If there are errors in this chart it means that the token could not be rotated. Please check the TSB Control Plane operator logs to find the cause. Use the following command: kubectl logs -n istio-system -l name=tsb-operator --tail=-1 | grep "token rotation failed, retrying in"

Metric Name	Labels	PromQL Expression
`token_rotation_executions_count_total`	`cluster_name` `status`	sum(increase(token_rotation_executions_count_total{cluster_name=~"${cluster}", status="failed"}[5m])) by(cluster_name)

Tokens Exceeded Rotation Time

Number of control plane tokens that exceeded their rotation time.

Metric Name	Labels	PromQL Expression
`token_exceeded_rotation_time`	`cluster_name`	max(max_over_time(token_exceeded_rotation_time{cluster_name=~"${cluster}"}[1m])) by (cluster_name)

Tokens Rotated Successfully.

Number of tokens that have been rotated successfully. If the values are 0 it might mean that the token rotation execution is failing or that TSB Control Plane operator is not even running.

Metric Name	Labels	PromQL Expression
`token_rotations_count_total`	`cluster_name` `status`	sum(increase(token_rotations_count_total{cluster_name=~"${cluster}", status="success"}[1m])) by(cluster_name)

Failed Tokens to Rotate.

Number of tokens that have been could not be rotated.

Metric Name	Labels	PromQL Expression
`token_rotations_count_total`	`cluster_name` `status`	sum(increase(token_rotations_count_total{cluster_name=~"${cluster}", status="failed"}[1m])) by(cluster_name)

Tokens Exceeded Rotation Timeline

Number of control plane tokens that exceeded their rotation time.

Metric Name	Labels	PromQL Expression
`token_exceeded_rotation_time`	`cluster_name`	max(max_over_time(token_exceeded_rotation_time{cluster_name=~"${cluster}"}[1m])) by (cluster_name)

Valid Tokens Timeline

Number of valid control plane tokens.

Metric Name	Labels	PromQL Expression
`valid_tokens`	`cluster_name`	sum(valid_tokens{cluster_name=~"${cluster}"}) by (cluster_name)

Invalid Tokens

Number of invalid control plane tokens grouped by the reason.

Metric Name	Labels	PromQL Expression
`invalid_tokens`	`cluster_name`	sum by (reason) (invalid_tokens{cluster_name=~"$cluster"})
`token_missing_rotate_at_annotation`	`cluster_name`	sum by (name) (token_missing_rotate_at_annotation{cluster_name=~"$cluster"})

Embedded Postgres

Postgres Scrape Status

Shows status of the metrics colleciton process

Metric Name	Labels	PromQL Expression
`pg_exporter_last_scrape_error`	N/A	sum(pg_exporter_last_scrape_error)

Postgres Scrape Status

Shows status of the metrics colleciton process

Metric Name	Labels	PromQL Expression
`pg_exporter_last_scrape_error`	N/A	max(pg_exporter_last_scrape_error) by (role)

Postgres UP

Metric Name	Labels	PromQL Expression
`pg_up`	N/A	max by(role) (pg_up)

Kubegres Reconciliation Health

Monitors the health of the Kubegres controller by tracking reconciliation success rate, errors, and total reconciliations. This is the most critical dashboard for understanding if the operator is functioning correctly.

Metric Name	Labels	PromQL Expression
`controller_runtime_reconcile_total`	`component`	sum by(result) (rate(controller_runtime_reconcile_total{component="kubegres"}[1m]))

Kubegres Reconciliation Latency P95

Tracks how long reconciliation loops take to complete. High latency may indicate issues with resources, API server slowness, or complex state changes. Use this to identify performance degradation.

Metric Name	Labels	PromQL Expression
`controller_runtime_reconcile_time_seconds_bucket`	`component`	sum by(le) (histogram_quantile(0.95, rate(controller_runtime_reconcile_time_seconds_bucket{component="kubegres"}[1m])))

Kubegres Work Queue Status

Monitors the controller's work queue to detect backpressure, retries, and processing delays. A growing queue depth or high retry count indicates the controller is struggling to keep up with changes.

Metric Name	Labels	PromQL Expression
`workqueue_depth`	`component`	max by(cluster_name) (workqueue_depth{component="kubegres"})
`workqueue_retries_total`	`component`	max by(cluster_name) (rate(workqueue_retries_total{component="kubegres"}[1m]))

Current Replication Lag

Metric Name	Labels	PromQL Expression
`pg_replication_lag_seconds`	N/A	max by(role) (pg_replication_lag_seconds)

Max Replication Lag [s]

Metric Name	Labels	PromQL Expression
`pg_replication_lag_seconds`	N/A	max by(role) (pg_replication_lag_seconds)

Active Replication Slots

Shows the number of active replication slots per cluster. Having an inactive replication slot for an extended period can cause WAL files to accumulate and fill up the database filesystem. Expect: Number of replication slots to be equal number of running replicas. If replication slots are disabled expect: No Data.

Metric Name	Labels	PromQL Expression
`pg_replication_slots_active`	N/A	count by (cluster_name) (pg_replication_slots_active==1)

Inactive Replication Slots

Shows the number of inactive replication slots per cluster. Having an inactive replication slot for an extended period can cause WAL files to accumulate and fill up the database filesystem. Expect: No Data in the common scenario, but expect some entries for a short time in case of replicas restart or scale up/down.

Metric Name	Labels	PromQL Expression
`pg_replication_slots_active`	N/A	count by (cluster_name) (pg_replication_slots_active==0)

Connections used

Percentage of max_connections used

Metric Name	Labels	PromQL Expression
`pg_settings_max_connections`	N/A	sum(pg_stat_database_numbackends)/max(pg_settings_max_connections)
`pg_stat_database_numbackends`	N/A	sum(pg_stat_database_numbackends)/max(pg_settings_max_connections)

Connections used

Metric Name	Labels	PromQL Expression
`pg_settings_max_connections`	N/A	100*sum(pg_stat_database_numbackends)/max(pg_settings_max_connections)
`pg_stat_database_numbackends`	N/A	100*sum(pg_stat_database_numbackends)/max(pg_settings_max_connections)

Failed WAL archiving attempts

Should be 0

Source: pg_stat_database

With log_lock_waits turned on, deadlocks will be logged to the PostgreSQL Logfiles.

Metric Name	Labels	PromQL Expression
`pg_stat_archiver_failed_count_total`	`role`	max by(role) (pg_stat_archiver_failed_count_total{role="primary"})

DBs Size

Metric Name	Labels	PromQL Expression
`pg_database_size_bytes`	`role`	max by(datname) (pg_database_size_bytes{role="primary"})

Table Size

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`pg_stat_user_tables_index_size_bytes`	`role`	max((pg_stat_user_tables_table_size_bytes{role="primary"}+pg_stat_user_tables_index_size_bytes{role="primary"}) or pg_stat_user_tables_size_bytes{role="primary"}) by (relname)
`pg_stat_user_tables_size_bytes`	`role`	max((pg_stat_user_tables_table_size_bytes{role="primary"}+pg_stat_user_tables_index_size_bytes{role="primary"}) or pg_stat_user_tables_size_bytes{role="primary"}) by (relname)
`pg_stat_user_tables_table_size_bytes`	`role`	max((pg_stat_user_tables_table_size_bytes{role="primary"}+pg_stat_user_tables_index_size_bytes{role="primary"}) or pg_stat_user_tables_size_bytes{role="primary"}) by (relname)

pg_stat_user_tables_index_size_bytes

role

max((pg_stat_user_tables_table_size_bytes{role="primary"}+pg_stat_user_tables_index_size_bytes{role="primary"}) or pg_stat_user_tables_size_bytes{role="primary"}) by (relname)

pg_stat_user_tables_size_bytes

role

max((pg_stat_user_tables_table_size_bytes{role="primary"}+pg_stat_user_tables_index_size_bytes{role="primary"}) or pg_stat_user_tables_size_bytes{role="primary"}) by (relname)

pg_stat_user_tables_table_size_bytes

role

max((pg_stat_user_tables_table_size_bytes{role="primary"}+pg_stat_user_tables_index_size_bytes{role="primary"}) or pg_stat_user_tables_size_bytes{role="primary"}) by (relname)

Checkpoint sync time

Total amount of time that has been spent in the portion of checkpoint processing where files are synchronized to disk, in milliseconds

Metric Name	Labels	PromQL Expression
`pg_stat_bgwriter_checkpoint_sync_time_total`	`role`	max(rate(pg_stat_bgwriter_checkpoint_sync_time_total{role="primary"}[1m])) by(role, index)

WAL segments size

Total size of WAL segments

Metric Name	Labels	PromQL Expression
`pg_wal_size_bytes`	`role`	max(pg_wal_size_bytes{role="primary"}) by (role)
`pg_wal_size_bytes`	`role`	max by(role) (pg_wal_size_bytes{role="replica"})

Buffer hits percentage by table

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`pg_statio_user_tables_heap_blocks_hit_total`	`role`	max(100 * ( rate(pg_statio_user_tables_heap_blocks_hit_total{role="primary"}[1m]) / ( rate(pg_statio_user_tables_heap_blocks_hit_total{role="primary"}[1m]) + rate(pg_statio_user_tables_heap_blocks_read_total{role="primary"}[1m]) ) )) by (relname)
`pg_statio_user_tables_heap_blocks_read_total`	`role`	max(100 * ( rate(pg_statio_user_tables_heap_blocks_hit_total{role="primary"}[1m]) / ( rate(pg_statio_user_tables_heap_blocks_hit_total{role="primary"}[1m]) + rate(pg_statio_user_tables_heap_blocks_read_total{role="primary"}[1m]) ) )) by (relname)

pg_statio_user_tables_heap_blocks_hit_total

role

max(100 * (   rate(pg_statio_user_tables_heap_blocks_hit_total{role="primary"}[1m])   /   (     rate(pg_statio_user_tables_heap_blocks_hit_total{role="primary"}[1m])     +     rate(pg_statio_user_tables_heap_blocks_read_total{role="primary"}[1m])   ) )) by (relname)

pg_statio_user_tables_heap_blocks_read_total

role

max(100 * (   rate(pg_statio_user_tables_heap_blocks_hit_total{role="primary"}[1m])   /   (     rate(pg_statio_user_tables_heap_blocks_hit_total{role="primary"}[1m])     +     rate(pg_statio_user_tables_heap_blocks_read_total{role="primary"}[1m])   ) )) by (relname)

Disk block reads by table

Metric Name	Labels	PromQL Expression
`pg_statio_user_tables_heap_blocks_read_total`	`role`	avg(rate(pg_statio_user_tables_heap_blocks_read_total{role="primary"}[1m])) by(relname)

Front Envoy Operational Status

Dedicated operational dashboard for the Front Envoy proxy in the TSB management plane. Surfaces traffic, latency, flow control, and connection metrics to reduce MTTR during incidents.

Success Rate by Upstream Cluster

Rate of successful requests through Front Envoy, broken down by upstream cluster. Covers all inbound gRPC and REST traffic entering the TSB management plane.

Metric Name	Labels	PromQL Expression
`envoy_cluster_internal_upstream_rq_total`	`component` `envoy_response_code`	sum(rate(envoy_cluster_internal_upstream_rq_total{envoy_response_code=~"2.\|3.\|401", component="front-envoy"}[1m])) by (envoy_cluster_name)

Error Rate by Upstream Cluster

Rate of failed requests through Front Envoy (non-2xx, non-3xx, non-401 response codes), broken down by upstream cluster and response code.

Metric Name	Labels	PromQL Expression
`envoy_cluster_internal_upstream_rq_total`	`component` `envoy_response_code`	sum(rate(envoy_cluster_internal_upstream_rq_total{envoy_response_code!~"2.\|3.\|401", component="front-envoy"}[1m])) by (envoy_cluster_name, envoy_response_code)

Request Latency by Upstream Cluster

Front Envoy upstream request latency percentiles (P99, P95, P50), broken down by upstream cluster. Latency spikes here indicate downstream or upstream congestion.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`envoy_cluster_internal_upstream_rq_time_bucket`	`component`	histogram_quantile(0.99, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[1m])) by (le, envoy_cluster_name))
`envoy_cluster_internal_upstream_rq_time_bucket`	`component`	histogram_quantile(0.95, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[1m])) by (le, envoy_cluster_name))
`envoy_cluster_internal_upstream_rq_time_bucket`	`component`	histogram_quantile(0.50, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[1m])) by (le, envoy_cluster_name))

envoy_cluster_internal_upstream_rq_time_bucket

component

histogram_quantile(0.99, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[1m])) by (le, envoy_cluster_name))

envoy_cluster_internal_upstream_rq_time_bucket

component

histogram_quantile(0.95, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[1m])) by (le, envoy_cluster_name))

envoy_cluster_internal_upstream_rq_time_bucket

component

histogram_quantile(0.50, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[1m])) by (le, envoy_cluster_name))

Number of Downstream flow-control paused events that have NOT yet resumed reading

Shows the number of downstream connections(from edge, central, mpc) or streams for which Envoy has paused reading due to flow-control/backpressure and has not yet resumed reading. Greater than 0 for longer periods means congestion is happening.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`envoy_http_downstream_flow_control_paused_reading_total`	`component` `envoy_http_conn_manager_prefix`	sum(envoy_http_downstream_flow_control_paused_reading_total{component="front-envoy", envoy_http_conn_manager_prefix="tsb"}-envoy_http_downstream_flow_control_resumed_reading_total{component="front-envoy", envoy_http_conn_manager_prefix="tsb"}) by (envoy_http_conn_manager_prefix)
`envoy_http_downstream_flow_control_resumed_reading_total`	`component` `envoy_http_conn_manager_prefix`	sum(envoy_http_downstream_flow_control_paused_reading_total{component="front-envoy", envoy_http_conn_manager_prefix="tsb"}-envoy_http_downstream_flow_control_resumed_reading_total{component="front-envoy", envoy_http_conn_manager_prefix="tsb"}) by (envoy_http_conn_manager_prefix)

envoy_http_downstream_flow_control_paused_reading_total

component envoy_http_conn_manager_prefix

sum(envoy_http_downstream_flow_control_paused_reading_total{component="front-envoy", envoy_http_conn_manager_prefix="tsb"}-envoy_http_downstream_flow_control_resumed_reading_total{component="front-envoy", envoy_http_conn_manager_prefix="tsb"}) by (envoy_http_conn_manager_prefix)

envoy_http_downstream_flow_control_resumed_reading_total

component envoy_http_conn_manager_prefix

sum(envoy_http_downstream_flow_control_paused_reading_total{component="front-envoy", envoy_http_conn_manager_prefix="tsb"}-envoy_http_downstream_flow_control_resumed_reading_total{component="front-envoy", envoy_http_conn_manager_prefix="tsb"}) by (envoy_http_conn_manager_prefix)

Number of upstream flow-control paused events that have NOT yet resumed reading

Shows the number of upstream connections or streams for which Envoy has paused reading due to flow-control/backpressure and has not yet resumed reading. Greater than 0 for longer periods means congestion is happening.

Label meanings —

tsb: HTTP requests to TSB Bridge pod from tctl clients

tsb-grpc: gRPC requests to TSB Bridge pod from MPC

xcp-config-status-tls: xcp-config-status requests to XCP Central from XCP Edge or MPC

xcp-http-api-tls: HTTP requests to XCP Central diagnostics API;

xcp-tls: all XCP requests to XCP Central from edges or MPC.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_flow_control_paused_reading_total`	`component` `envoy_cluster_name`	sum( envoy_cluster_upstream_flow_control_paused_reading_total{ component="front-envoy", envoy_cluster_name=~~"tsb\|tsb-grpc\|xcp-config-status-tls\|xcp-tls\|xcp-http-api-tls" } - envoy_cluster_upstream_flow_control_resumed_reading_total{ component="front-envoy", envoy_cluster_name=~~"tsb\|tsb-grpc\|xcp-config-status-tls\|xcp-tls\|xcp-http-api-tls" } ) by (envoy_cluster_name)
`envoy_cluster_upstream_flow_control_resumed_reading_total`	`component` `envoy_cluster_name`	sum( envoy_cluster_upstream_flow_control_paused_reading_total{ component="front-envoy", envoy_cluster_name=~~"tsb\|tsb-grpc\|xcp-config-status-tls\|xcp-tls\|xcp-http-api-tls" } - envoy_cluster_upstream_flow_control_resumed_reading_total{ component="front-envoy", envoy_cluster_name=~~"tsb\|tsb-grpc\|xcp-config-status-tls\|xcp-tls\|xcp-http-api-tls" } ) by (envoy_cluster_name)

envoy_cluster_upstream_flow_control_paused_reading_total

component envoy_cluster_name

sum(   envoy_cluster_upstream_flow_control_paused_reading_total{     component="front-envoy",     envoy_cluster_name="tsb|tsb-grpc|xcp-config-status-tls|xcp-tls|xcp-http-api-tls"   } -   envoy_cluster_upstream_flow_control_resumed_reading_total{     component="front-envoy",     envoy_cluster_name="tsb|tsb-grpc|xcp-config-status-tls|xcp-tls|xcp-http-api-tls"   } ) by (envoy_cluster_name)

envoy_cluster_upstream_flow_control_resumed_reading_total

component envoy_cluster_name

sum(   envoy_cluster_upstream_flow_control_paused_reading_total{     component="front-envoy",     envoy_cluster_name="tsb|tsb-grpc|xcp-config-status-tls|xcp-tls|xcp-http-api-tls"   } -   envoy_cluster_upstream_flow_control_resumed_reading_total{     component="front-envoy",     envoy_cluster_name="tsb|tsb-grpc|xcp-config-status-tls|xcp-tls|xcp-http-api-tls"   } ) by (envoy_cluster_name)

Number of upstream flow-control backup events that have NOT yet drained

Represents how many upstream connections or streams are currently in a state where Envoy has buffered more data than it can immediately send because the upstream side is consuming data too slowly. Greater than 0 for longer periods means congestion is happening.

Label meanings —

tsb: HTTP requests to TSB Bridge pod from tctl clients

tsb-grpc: gRPC requests to TSB Bridge pod from MPC

xcp-config-status-tls: xcp-config-status requests to XCP Central from XCP Edge or MPC

xcp-http-api-tls: HTTP requests to XCP Central diagnostics API;

xcp-tls: all XCP requests to XCP Central from edges or MPC.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_flow_control_backed_up_total`	`component` `envoy_cluster_name`	sum( envoy_cluster_upstream_flow_control_backed_up_total{ component="front-envoy", envoy_cluster_name=~~"tsb\|tsb-grpc\|xcp-config-status-tls\|xcp-tls\|xcp-http-api-tls" } - envoy_cluster_upstream_flow_control_drained_total{ component="front-envoy", envoy_cluster_name=~~"tsb\|tsb-grpc\|xcp-config-status-tls\|xcp-tls\|xcp-http-api-tls" } ) by (envoy_cluster_name)
`envoy_cluster_upstream_flow_control_drained_total`	`component` `envoy_cluster_name`	sum( envoy_cluster_upstream_flow_control_backed_up_total{ component="front-envoy", envoy_cluster_name=~~"tsb\|tsb-grpc\|xcp-config-status-tls\|xcp-tls\|xcp-http-api-tls" } - envoy_cluster_upstream_flow_control_drained_total{ component="front-envoy", envoy_cluster_name=~~"tsb\|tsb-grpc\|xcp-config-status-tls\|xcp-tls\|xcp-http-api-tls" } ) by (envoy_cluster_name)

envoy_cluster_upstream_flow_control_backed_up_total

component envoy_cluster_name

sum(   envoy_cluster_upstream_flow_control_backed_up_total{     component="front-envoy",     envoy_cluster_name="tsb|tsb-grpc|xcp-config-status-tls|xcp-tls|xcp-http-api-tls"   } -   envoy_cluster_upstream_flow_control_drained_total{     component="front-envoy",     envoy_cluster_name="tsb|tsb-grpc|xcp-config-status-tls|xcp-tls|xcp-http-api-tls"   } ) by (envoy_cluster_name)

envoy_cluster_upstream_flow_control_drained_total

component envoy_cluster_name

sum(   envoy_cluster_upstream_flow_control_backed_up_total{     component="front-envoy",     envoy_cluster_name="tsb|tsb-grpc|xcp-config-status-tls|xcp-tls|xcp-http-api-tls"   } -   envoy_cluster_upstream_flow_control_drained_total{     component="front-envoy",     envoy_cluster_name="tsb|tsb-grpc|xcp-config-status-tls|xcp-tls|xcp-http-api-tls"   } ) by (envoy_cluster_name)

Upstream Request Timeouts and Pending Overflows

Upstream request timeout rate. Elevated timeouts while upstream flow control is also active points to clients waiting for responses that are queued behind flow-controlled upstream connections.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_rq_pending_overflow_total`	`component`	sum(rate(envoy_cluster_upstream_rq_pending_overflow_total{component="front-envoy"}[1m])) by (envoy_cluster_name)
`envoy_cluster_upstream_rq_timeout_total`	`component`	sum(rate(envoy_cluster_upstream_rq_timeout_total{component="front-envoy"}[1m])) by (envoy_cluster_name)

envoy_cluster_upstream_rq_pending_overflow_total

component

sum(rate(envoy_cluster_upstream_rq_pending_overflow_total{component="front-envoy"}[1m])) by (envoy_cluster_name)

envoy_cluster_upstream_rq_timeout_total

component

sum(rate(envoy_cluster_upstream_rq_timeout_total{component="front-envoy"}[1m])) by (envoy_cluster_name)

HTTP/2 Pending Send Bytes by Upstream Cluster

Bytes queued in HTTP/2 send buffers per upstream cluster. A large and growing value means the upstream is not draining data fast enough — a key indicator of flow control pressure.

Metric Name	Labels	PromQL Expression
`envoy_cluster_http2_pending_send_bytes`	`component`	sum(envoy_cluster_http2_pending_send_bytes{component="front-envoy"}) by (envoy_cluster_name)

Active Downstream Connections

Number of active (open) downstream connections (From Central, Edges, MPC and other components)currently held by Front Envoy listeners. A steady increase without a corresponding traffic increase may indicate connection leaks or slow clients.

Metric Name	Labels	PromQL Expression
`envoy_listener_downstream_cx_active`	`component`	sum(envoy_listener_downstream_cx_active{component="front-envoy"}) by (job)

Downstream Connection Rate

Rate of new downstream connections accepted by Front Envoy per second.

Metric Name	Labels	PromQL Expression
`envoy_listener_downstream_cx_total`	`component`	sum(rate(envoy_listener_downstream_cx_total{component="front-envoy"}[1m])) by (job)

Active Upstream Connections by Cluster

Number of active upstream connections from Front Envoy to each backend cluster. Correlate with flow control metrics: high pending send bytes with few upstream connections may indicate a connection pool saturation.

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_cx_active`	`component`	sum(envoy_cluster_upstream_cx_active{component="front-envoy"}) by (envoy_cluster_name)

Healthy Endpoints per Cluster

Number of healthy endpoints in each upstream cluster as seen by Front Envoy. A drop to 0 means Front Envoy has no healthy host to route to and will return 503. Correlate with upstream error spikes.

Metric Name	Labels	PromQL Expression
`envoy_cluster_membership_healthy`	`component`	envoy_cluster_membership_healthy{component="front-envoy"}

Upstream Connection Failures and Timeouts

Rate of upstream connection establishment failures (connect_fail) and timeouts (connect_timeout) per cluster. Non-zero values indicate the upstream is unreachable or overloaded — Front Envoy cannot open new connections to serve requests.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_cx_connect_fail_total`	`component`	sum(rate(envoy_cluster_upstream_cx_connect_fail_total{component="front-envoy"}[1m])) by (envoy_cluster_name)
`envoy_cluster_upstream_cx_connect_timeout_total`	`component`	sum(rate(envoy_cluster_upstream_cx_connect_timeout_total{component="front-envoy"}[1m])) by (envoy_cluster_name)

envoy_cluster_upstream_cx_connect_fail_total

component

sum(rate(envoy_cluster_upstream_cx_connect_fail_total{component="front-envoy"}[1m])) by (envoy_cluster_name)

envoy_cluster_upstream_cx_connect_timeout_total

component

sum(rate(envoy_cluster_upstream_cx_connect_timeout_total{component="front-envoy"}[1m])) by (envoy_cluster_name)

Upstream Connections Destroyed with Active Requests

Rate of upstream connections destroyed while carrying active requests. Each event means in-flight requests were abruptly terminated — callers receive 503 or a stream reset. Sustained non-zero values indicate upstream instability.

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_cx_destroy_with_active_rq_total`	`component`	sum(rate(envoy_cluster_upstream_cx_destroy_with_active_rq_total{component="front-envoy"}[1m])) by (envoy_cluster_name)

GitOps Operational Status

Operational metrics to indicate Cluster GitOps health

GitOps Status

Shows the status of the GitOps component for each cluster.

Metric Name	Labels	PromQL Expression
`gitops_enabled`	N/A	gitops_enabled

Accepted Admission Requests

Accepted admission requests for each cluster. This is the rate at which operations are processed by the GitOps relay and sent to TSB.

Metric Name	Labels	PromQL Expression
`gitops_admission_count_total`	`allowed`	sum(rate(gitops_admission_count_total{allowed="true"}[1h])) by (cluster_name, component)

Rejected Admission Requests

Rejected admission requests for each cluster. This is the rate at which operations are processed by the GitOps relay and sent to TSB.

A spike in these metrics may indicate an increase in invalid TSB resources being applied to the Kubernetes clusters, or error in the admission webhook processing.

Metric Name	Labels	PromQL Expression
`gitops_admission_count_total`	`allowed`	sum(rate(gitops_admission_count_total{allowed="false"}[1h])) by (cluster_name, component)

Admission Review Latency

Admission review latency percentiles grouped by cluster.

The GitOps admission reviews make decisions by forwarding the objects to the Management Plane. This metric helps understand the time it takes to make such decisions.

A spike here may indicate network issues or connectivity issues between the Control Plane and the Management Plane.

Metric Name Labels PromQL Expression

Metric Name	Labels	PromQL Expression
`gitops_admission_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(gitops_admission_duration_bucket[1h])) by (cluster_name, component, le))
`gitops_admission_duration_bucket`	N/A	histogram_quantile(0.95, sum(rate(gitops_admission_duration_bucket[1h])) by (cluster_name, component, le))

gitops_admission_duration_bucket

N/A

histogram_quantile(0.99, sum(rate(gitops_admission_duration_bucket[1h])) by (cluster_name, component, le))

gitops_admission_duration_bucket

N/A

histogram_quantile(0.95, sum(rate(gitops_admission_duration_bucket[1h])) by (cluster_name, component, le))

Resources Pushed to TSB

Number of resources pushed to the Management Plane.

This should be equivalent to the admission requests in most cases, but this will also account for object pushes that are done by the background reconcile processes.

Metric Name	Labels	PromQL Expression
`gitops_push_count_total`	`success`	sum(rate(gitops_push_count_total{success="true"}[1h])) by (cluster_name, component)

Failed pushes to TSB

Number of resource pushes to the Management Plane that failed.

This should be equivalent to the failed admission requests in most cases, but this will also account for object pushes that are done by the background reconciliation processes.

Metric Name	Labels	PromQL Expression
`gitops_push_count_total`	`code`	sum(rate(gitops_push_count_total{code!="OK"}[1h])) by (cluster_name, component, code)

Resources Conversions

Number of Kubernetes resources that have been read from the cluster and successfully converted into TSB objects to be pushed to the Management plane.

The values for this metric should be the same as the Pushed Objects. If there is a difference between them, it probably means some issue when converting the Kubernetes objects to TSB objects.

Metric Name	Labels	PromQL Expression
`gitops_convert_count_total`	`success`	sum(rate(gitops_convert_count_total{success="true"}[1h])) by (cluster_name, component)

Resources conversions errors

Number of Kubernetes resources that have been read from the cluster and failed to be converted into TSB objects.

A spike on this metric indicates that the Kubernetes objects could not be converted to TSB objects and that those resources were not sent to the Management Plane.

Metric Name	Labels	PromQL Expression
`gitops_convert_count_total`	`success`	sum(rate(gitops_convert_count_total{success="false"}[1h])) by (cluster_name, component)

Global Configuration Distribution

These metrics indicate the overall health of Tetrate Service Bridge and should be considered the starting point for any investigation into issues with Tetrate Service Bridge.

Connected Clusters

This details all clusters connected to and receiving configuration from the management plane.

If this number drops below 1 or a given cluster does not appear in this table it means that the cluster is disconnected. This may happen for a brief period of time during upgrades/re-deploys.

Metric Name	Labels	PromQL Expression
`xcp_central_current_edge_connections`	N/A	xcp_central_current_edge_connections

TSB Error Rate (Humans)

Rate of failed requests to the TSB apiserver from the UI and CLI.

Metric Name	Labels	PromQL Expression
`grpc_server_handled_total`	`component` `grpc_code` `grpc_method` `grpc_type`	sum(rate(grpc_server_handled_total{component="tsb", grpc_code!="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) by (grpc_code) OR on() vector(0)

Istio-Envoy Sync Time (P99)

Once XCP has synced with the management plane it creates resources for Istio to configure Envoy. Istio usually distributes these within a second.

If this number starts to exceed 10 seconds then you may need to scale out istiod. In small clusters, it is possible this number is too small to be handled by the histogram buckets so may be nil.

Metric Name	Labels	PromQL Expression
`pilot_proxy_convergence_time_bucket`	N/A	histogram_quantile(99/100, sum(rate(pilot_proxy_convergence_time_bucket[1m])) by (le, cluster_name))

XCP central -> edge Sync Time (P99)

MPC component translates TSB configuration into XCP objects. XCP central then sends these objects to every Edge connected to it.

This is the time taken for XCP central to send the configs to edges in ms.

Metric Name	Labels	PromQL Expression
`xcp_central_config_propagation_time_ms_bucket`	N/A	histogram_quantile(99/100, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge))

Istiod Errors

Rate of istiod errors broken down by cluster. This graph helps identify clusters that may be experiencing problems. Typically, there should be no errors. Any non-transient errors should be investigated.

Sometimes this graph will show "No data" or these metrics won't exist. This is because istiod only emits these metrics if the errors occur.

Metric Name	Labels	PromQL Expression
`pilot_total_xds_internal_errors`	N/A	sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
`pilot_total_xds_rejects`	N/A	sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
`pilot_xds_expired_nonce`	N/A	sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
`pilot_xds_push_context_errors`	N/A	sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
`pilot_xds_pushes`	`type`	sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)
`pilot_xds_write_timeout`	N/A	sum(rate(pilot_xds_write_timeout[1m])) by (cluster_name) + sum(rate(pilot_total_xds_internal_errors[1m])) by (cluster_name) + sum(rate(pilot_total_xds_rejects[1m])) by (cluster_name) + sum(rate(pilot_xds_expired_nonce[1m])) by (cluster_name) + sum(rate(pilot_xds_push_context_errors[1m])) by (cluster_name) + sum(rate(pilot_xds_pushes{type=~".*_senderr"}[1m])) by (cluster_name) OR on() vector(0)

1. Time for config to be visible by XCP Central (P99)

Time it takes for the configuration to be seen by XCP Central in the Management Plane.

This includes:

The time it takes for MPC to pull configs from TSB
Time it takes to process the configurations.
Time it takes to apply the delta to the "tsb" namespace so that configurations are visible by XCP Central.

After this time, XCP Central sees the configurations and distributes them to the worklaod clusters.

Metric Name	Labels	PromQL Expression
`mpc_config_total_propagation_duration_bucket`	N/A	histogram_quantile(99/100, sum(rate(mpc_config_total_propagation_duration_bucket[1m])) by (le))

2. Time it takes to send the configs from MP to the workload clusters (P99)

Once XCP Central sees the configurations in the "tsb" namespace, it sends the configs to the workload clusters.

This panel shows the time it takes (P99) to send the configs to each cluster.

Metric Name	Labels	PromQL Expression
`xcp_central_config_propagation_time_ms_bucket`	N/A	histogram_quantile(99/100, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge))

3. Istio Config generation time (P99)

The time it takes for the TSB agents in the workload cluster to create the Istio configurations for all the namespaces.

Metric Name	Labels	PromQL Expression
`xcp_edge_total_translation_time_in_ms_bucket`	N/A	histogram_quantile(99/100,sum(rate(xcp_edge_total_translation_time_in_ms_bucket[1m])) by (le, cluster_name))

4. Proxy convergence time (P99)

Time it takes for Istio to distribute the configuration to the Envoy proxies.

If this number starts to exceed 10 seconds then you may need to scale out istiod. In small clusters, it is possible this number is too small to be handled by the histogram buckets so may be nil.

Metric Name	Labels	PromQL Expression
`pilot_proxy_convergence_time_bucket`	N/A	histogram_quantile(99/100, sum(rate(pilot_proxy_convergence_time_bucket[1m])) by (le, cluster_name))

Istiod / Pilot Control Plane (MP)

Connected Proxies

pilot_xds — total endpoints connected via xDS

Metric Name	Labels	PromQL Expression
`pilot_xds`	`cluster_name` `istio_revision` `version`	sum(pilot_xds{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision", version!="65536.65536.65536"})

Total Error Rate

Metric Name Labels PromQL Expression

pilot_sds_certificate_errors_total

cluster_name istio_revision

sum(rate(pilot_xds_rejects_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) + sum(rate(pilot_xds_expired_nonce_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) + sum(rate(pilot_sds_certificate_errors_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) or on() vector(0)

pilot_xds_expired_nonce_total

cluster_name istio_revision

sum(rate(pilot_xds_rejects_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) + sum(rate(pilot_xds_expired_nonce_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) + sum(rate(pilot_sds_certificate_errors_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) or on() vector(0)

pilot_xds_rejects_total

cluster_name istio_revision

sum(rate(pilot_xds_rejects_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) + sum(rate(pilot_xds_expired_nonce_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) + sum(rate(pilot_sds_certificate_errors_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) or on() vector(0)

Median Proxy Convergence Time

Metric Name	Labels	PromQL Expression
`pilot_proxy_convergence_time_bucket`	`cluster_name` `istio_revision`	histogram_quantile(0.5, sum(rate(pilot_proxy_convergence_time_bucket{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m])) by (le))

Services Known

pilot_services — total services in the mesh

Metric Name	Labels	PromQL Expression
`pilot_services`	`cluster_name` `istio_revision`	max(pilot_services{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})

VirtualServices

pilot_virt_services — total VS known to pilot

Metric Name	Labels	PromQL Expression
`pilot_virt_services`	`cluster_name` `istio_revision`	max(pilot_virt_services{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})

Build & Version Info

istio_build + pilot_info + pilot_xds by version

Metric Name	Labels	PromQL Expression
`istio_build`	`cluster_name` `istio_revision`	istio_build{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}

Istiod Pods

Istiod pod instances connected for the selected revision

Metric Name	Labels	PromQL Expression
`istiod_uptime_seconds`	`cluster_name` `istio_revision`	istiod_uptime_seconds{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}

Connected Proxies by Instance

pilot_xds by proxy version over time. Useful during upgrades to see old vs new proxy versions.

Metric Name	Labels	PromQL Expression
`pilot_xds`	`cluster_name` `istio_revision` `version`	sum by (instance)(pilot_xds{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision", version!="65536.65536.65536"})

Istiod CPU Usage (cores)

CPU cores consumed by each istiod process (rate of process_cpu_seconds_total)

Metric Name	Labels	PromQL Expression
`process_cpu_seconds_total`	`cluster_name` `component` `istio_revision`	rate(process_cpu_seconds_total{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision", component="istiod"}[1m])

Istiod Memory Usage

Resident memory (RSS) of each istiod process

Metric Name Labels PromQL Expression

go_memstats_heap_inuse_bytes

cluster_name component istio_revision

go_memstats_heap_inuse_bytes{cluster_name="$cluster_name", istio_revision="$istio_revision", component="istiod"}

process_resident_memory_bytes

cluster_name component istio_revision

process_resident_memory_bytes{cluster_name="$cluster_name", istio_revision="$istio_revision", component="istiod"}

Root Cert Expires In

citadel_server_root_cert_expiry_seconds — time remaining before root cert expires. NEGATIVE = EXPIRED!

Metric Name	Labels	PromQL Expression
`citadel_server_root_cert_expiry_seconds`	`cluster_name` `istio_revision`	min(citadel_server_root_cert_expiry_seconds{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})

CSR Rate & Cert Issuance Rate

citadel_server_csr_count — CSRs received. citadel_server_success_cert_issuance_count — successful issuances. Gap = failures.

Metric Name Labels PromQL Expression

citadel_server_csr_count_total

cluster_name istio_revision

sum(rate(citadel_server_csr_count_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m]))

citadel_server_success_cert_issuance_count_total

cluster_name istio_revision

sum(rate(citadel_server_success_cert_issuance_count_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m]))

SDS Certificate Errors

pilot_sds_certificate_errors_total — failures fetching SDS key/cert. Non-zero = mTLS issues.

Metric Name Labels PromQL Expression

pilot_sds_certificate_errors_total

cluster_name istio_revision

sum(rate(pilot_sds_certificate_errors_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m]))

pilot_sds_certificate_errors_total

cluster_name istio_revision

sum(pilot_sds_certificate_errors_total{cluster_name="$cluster_name", istio_revision="$istio_revision"})

Sidecar Injection Rate

sidecar_injection_requests_total vs sidecar_injection_success_total — gap = failed injections

Metric Name Labels PromQL Expression

sidecar_injection_requests_total

cluster_name istio_revision

sum(rate(sidecar_injection_requests_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m]))

sidecar_injection_success_total

cluster_name istio_revision

sum(rate(sidecar_injection_success_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m]))

Injection Latency (p50 / p95 / p99)

sidecar_injection_time_seconds — time taken for sidecar injection. High = slow pod starts.

Metric Name Labels PromQL Expression

sidecar_injection_time_seconds_bucket

cluster_name istio_revision

histogram_quantile(0.50, sum by (le) (rate(sidecar_injection_time_seconds_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

sidecar_injection_time_seconds_bucket

cluster_name istio_revision

histogram_quantile(0.95, sum by (le) (rate(sidecar_injection_time_seconds_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

sidecar_injection_time_seconds_bucket

cluster_name istio_revision

histogram_quantile(0.99, sum by (le) (rate(sidecar_injection_time_seconds_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

Validation & Webhook

galley_validation_passed — validated resources. webhook_patch_failures_total — webhook patch failures by reason.

Metric Name Labels PromQL Expression

galley_validation_passed

cluster_name istio_revision

sum by (group, resource) (rate(galley_validation_passed{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m]))

webhook_patch_failures_total

cluster_name istio_revision

sum by (name, reason) (rate(webhook_patch_failures_total{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m]))

Endpoint Health

All the endpoint-related health gauges. Non-zero values on any of these = investigate.

Metric Name	Labels	PromQL Expression
`endpoint_no_pod`	`cluster_name` `istio_revision`	sum(endpoint_no_pod{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})
`pilot_eds_no_instances`	`cluster_name` `istio_revision`	sum(pilot_eds_no_instances{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})
`pilot_endpoint_not_ready`	`cluster_name` `istio_revision`	sum(pilot_endpoint_not_ready{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})
`pilot_k8s_endpoints_pending_pod`	`cluster_name` `istio_revision`	sum(pilot_k8s_endpoints_pending_pod{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})
`pilot_no_ip`	`cluster_name` `istio_revision`	sum(pilot_no_ip{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})

Config Conflicts

All conflict gauges. Non-zero = config issues that may cause unexpected routing. pilot_conflict_inbound_listener, pilot_conflict_outbound_listener_tcp_over_current_tcp, pilot_destrule_subsets, pilot_duplicate_envoy_clusters, pilot_dns_cluster_without_endpoints, pilot_vservice_dup_domain.

Metric Name	Labels	PromQL Expression
`pilot_conflict_inbound_listener`	`cluster_name` `istio_revision`	sum(pilot_conflict_inbound_listener{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})
`pilot_conflict_outbound_listener_tcp_over_current_tcp`	`cluster_name` `istio_revision`	sum(pilot_conflict_outbound_listener_tcp_over_current_tcp{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})
`pilot_destrule_subsets`	`cluster_name` `istio_revision`	sum(pilot_destrule_subsets{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})
`pilot_dns_cluster_without_endpoints`	`cluster_name` `istio_revision`	sum(pilot_dns_cluster_without_endpoints{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})
`pilot_duplicate_envoy_clusters`	`cluster_name` `istio_revision`	sum(pilot_duplicate_envoy_clusters{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})
`pilot_vservice_dup_domain`	`cluster_name` `istio_revision`	sum(pilot_vservice_dup_domain{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})

Applied Envoy Filters

Number of successfully applied Envoy Filters in the cluster.

Metric Name	Labels	PromQL Expression
`pilot_envoy_filter_status`	`cluster_name` `istio_revision` `result`	sum(pilot_envoy_filter_status{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision", result="applied"})

Errored Envoy Filters

Number of Envoy Filters in the cluster that failed to apply.

Metric Name	Labels	PromQL Expression
`pilot_envoy_filter_status`	`cluster_name` `istio_revision` `result`	sum(pilot_envoy_filter_status{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision", result="error"})

Envoy Filter status

Status of each Envoy Filter in the cluster (applied/error).

Metric Name	Labels	PromQL Expression
`pilot_envoy_filter_status`	`cluster_name` `istio_revision`	pilot_envoy_filter_status{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}

Config Events (pilot_k8s_cfg_events)

Istio config resource events: add/update/delete by type (VirtualService, DestinationRule, Gateway, EnvoyFilter, ServiceEntry, etc). Spikes = someone is pushing config.

Metric Name	Labels	PromQL Expression
`pilot_k8s_cfg_events_total`	`cluster_name` `istio_revision`	sum by (type, event) (rate(pilot_k8s_cfg_events_total{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m]))

Registry Events (pilot_k8s_reg_events)

K8s registry events: Pods, Services, EndpointSlice, Nodes, Namespaces. Spikes = deployments, scaling, node changes.

Metric Name	Labels	PromQL Expression
`pilot_k8s_reg_events_total`	`cluster_name` `istio_revision`	sum by (type, event) (rate(pilot_k8s_reg_events_total{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m]))

Inbound Updates (pilot_inbound_updates)

Updates received by pilot: config, eds, svc, svcdelete. This is the aggregate trigger input before debouncing.

Metric Name	Labels	PromQL Expression
`pilot_inbound_updates_total`	`cluster_name` `istio_revision`	sum by (type) (rate(pilot_inbound_updates_total{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m]))

Push Triggers by Reason

pilot_push_triggers — what is causing pushes: config, endpoint, service, proxy, secret, networks. 'config' = Istio CR changes. 'endpoint' = pod scale events. 'secret' = cert rotations.

Metric Name	Labels	PromQL Expression
`pilot_push_triggers_total`	`cluster_name` `istio_revision`	sum by (type) (rate(pilot_push_triggers_total{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m]))

Debounce Time (p50 / p95 / p99)

pilot_debounce_time — delay between first config event and the merged push entering the queue. High = many events being batched (normal under churn). Very high = too much config churn.

Metric Name Labels PromQL Expression

pilot_debounce_time_bucket

cluster_name istio_revision

histogram_quantile(0.50, sum by (le) (rate(pilot_debounce_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_debounce_time_bucket

cluster_name istio_revision

histogram_quantile(0.95, sum by (le) (rate(pilot_debounce_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_debounce_time_bucket

cluster_name istio_revision

histogram_quantile(0.99, sum by (le) (rate(pilot_debounce_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

PushContext Init Time (p50 / p95 / p99)

pilot_pushcontext_init_seconds — time to build the internal push context (service index, config snapshot). High = large mesh with many services/configs.

Metric Name Labels PromQL Expression

pilot_pushcontext_init_seconds_bucket

cluster_name istio_revision

histogram_quantile(0.50, sum by (le) (rate(pilot_pushcontext_init_seconds_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_pushcontext_init_seconds_bucket

cluster_name istio_revision

histogram_quantile(0.95, sum by (le) (rate(pilot_pushcontext_init_seconds_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_pushcontext_init_seconds_bucket

cluster_name istio_revision

histogram_quantile(0.99, sum by (le) (rate(pilot_pushcontext_init_seconds_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

Proxy Queue Time (p50 / p95 / p99)

pilot_proxy_queue_time — time a proxy sits in the push queue before being processed. High = istiod overloaded, can't keep up with push rate.

Metric Name Labels PromQL Expression

pilot_proxy_queue_time_bucket

cluster_name istio_revision

histogram_quantile(0.50, sum by (le) (rate(pilot_proxy_queue_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_proxy_queue_time_bucket

cluster_name istio_revision

histogram_quantile(0.95, sum by (le) (rate(pilot_proxy_queue_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_proxy_queue_time_bucket

cluster_name istio_revision

histogram_quantile(0.99, sum by (le) (rate(pilot_proxy_queue_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

xDS Push Time by Type (p99)

pilot_xds_push_time — time to generate xDS config per type (CDS, LDS, RDS, EDS, SDS, ECDS, NDS). CDS/RDS usually slowest in large meshes.

Metric Name	Labels	PromQL Expression
`pilot_xds_push_time_bucket`	`cluster_name` `istio_revision`	histogram_quantile(0.99, sum by (type, le) (rate(pilot_xds_push_time_bucket{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m])))

xDS Send Time (p50 / p95 / p99)

pilot_xds_send_time — time to serialize and send generated config to proxy over gRPC. High = large configs or network issues.

Metric Name Labels PromQL Expression

pilot_xds_send_time_bucket

cluster_name istio_revision

histogram_quantile(0.50, sum by (le) (rate(pilot_xds_send_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_xds_send_time_bucket

cluster_name istio_revision

histogram_quantile(0.95, sum by (le) (rate(pilot_xds_send_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_xds_send_time_bucket

cluster_name istio_revision

histogram_quantile(0.99, sum by (le) (rate(pilot_xds_send_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

End-to-End Proxy Convergence Time (p50 / p95 / p99)

pilot_proxy_convergence_time — THE key metric. Total delay from config change to proxy having the config. This is the sum of all pipeline stages. Target: < 1s for most meshes, < 5s for large.

Metric Name Labels PromQL Expression

pilot_proxy_convergence_time_bucket

cluster_name istio_revision

histogram_quantile(0.50, sum by (le) (rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_proxy_convergence_time_bucket

cluster_name istio_revision

histogram_quantile(0.95, sum by (le) (rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

pilot_proxy_convergence_time_bucket

cluster_name istio_revision

histogram_quantile(0.99, sum by (le) (rate(pilot_proxy_convergence_time_bucket{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])))

xDS Pushes by Type

pilot_xds_pushes — push count per xDS type (CDS, LDS, RDS, EDS, SDS, ECDS, NDS). Rate shows how active the control plane is.

Metric Name	Labels	PromQL Expression
`pilot_xds_pushes_total`	`cluster_name` `istio_revision`	sum by (type) (rate(pilot_xds_pushes_total{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m]))

Config Size Pushed by Type (avg bytes)

pilot_xds_config_size_bytes — average size of config pushed per xDS type. CDS/RDS are typically the largest. Large config = slow pushes, high proxy memory.

Metric Name Labels PromQL Expression

pilot_xds_config_size_bytes_count

cluster_name istio_revision

sum by (type) (rate(pilot_xds_config_size_bytes_sum{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) / sum by (type) (rate(pilot_xds_config_size_bytes_count{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m]))

pilot_xds_config_size_bytes_sum

cluster_name istio_revision

sum by (type) (rate(pilot_xds_config_size_bytes_sum{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m])) / sum by (type) (rate(pilot_xds_config_size_bytes_count{cluster_name="$cluster_name", istio_revision="$istio_revision"}[1m]))

Expired Nonces by Type

pilot_xds_expired_nonce — proxy sent a request with an outdated nonce (it was too slow to ACK before the next push). High = proxy can't keep up with config churn.

Metric Name	Labels	PromQL Expression
`pilot_xds_expired_nonce_total`	`cluster_name` `istio_revision`	sum by (type) (rate(pilot_xds_expired_nonce_total{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m]))

Config Size by Type (p99 bytes)

pilot_xds_config_size_bytes as p99 histogram — worst-case config size per push type.

Metric Name	Labels	PromQL Expression
`pilot_xds_config_size_bytes_bucket`	`cluster_name` `istio_revision`	histogram_quantile(0.99, sum by (type, le) (rate(pilot_xds_config_size_bytes_bucket{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m])))

Largest Request Received

pilot_xds_recv_max — max size of an xDS request received from any proxy. Very large = proxy sending huge ACKs or NACKs.

Metric Name	Labels	PromQL Expression
`pilot_xds_recv_max`	`cluster_name` `istio_revision`	max(pilot_xds_recv_max{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"})

Total Config Bytes Pushed / sec

Total data rate of config being pushed across all types. High = istiod spending significant bandwidth on xDS.

Metric Name	Labels	PromQL Expression
`pilot_xds_config_size_bytes_sum`	`cluster_name` `istio_revision`	sum by (type) (rate(pilot_xds_config_size_bytes_sum{cluster_name=~~"$cluster_name", istio_revision=~~"$istio_revision"}[1m]))

MPC Operational Status

Operational metrics to indicate Management Plane Controller (MPC) health.

Propagate Config Objects duration

How long does it take to propagate all the configs from TSB to XCP. Time since MPC retrieves the configs until they are applied in the k8s namespace. It's composed of:

Reception time: how long it takes for the GetAllConfigObjects gRPC method to retrieve all configuration objects.
Conversion time: how long it takes to compute the model cache, execute the conversions and apply them to the k8s namespace.

These metrics are distributed into buckets that provide seconds of accuracy. The square steps displayed on the dashboard represent fluctuations within the next bucket.

Metric Name	Labels	PromQL Expression
`mpc_config_total_process_duration_bucket`	`component` `error`	histogram_quantile(0.99, sum(rate(mpc_config_total_process_duration_bucket{error="", component="mpc"}[5m])) by (le,component))
`mpc_config_total_process_duration_bucket`	`component` `error`	histogram_quantile(0.99, sum(rate(mpc_config_total_process_duration_bucket{error="true", component="mpc"}[5m])) by (le,component))
`mpc_config_total_propagation_duration_bucket`	`component`	histogram_quantile(0.99, sum(rate(mpc_config_total_propagation_duration_bucket{component="mpc"}[5m])) by (le,component))
`mpc_get_all_config_objects_duration_bucket`	`component` `error`	histogram_quantile(0.99, sum(rate(mpc_get_all_config_objects_duration_bucket{error="", component="mpc"}[5m])) by (le,component))
`mpc_get_all_config_objects_duration_bucket`	`component` `error`	histogram_quantile(0.99, sum(rate(mpc_get_all_config_objects_duration_bucket{error="true", component="mpc"}[5m])) by (le,component))

Received configs

The number of resources that sent from TSB to MPC.

This metric shows the number of objects that are created, updated, and deleted as part of a configuration push from MPC to XCP.

This metric can be used together with the XCP push operations and push duration to get an understanding of how the amount of resources being pushed to XCP affects the time it takes for the entire configuration push operation to complete.

Metric Name	Labels	PromQL Expression
`mpc_tsb_config_received_count`	`component` `resource`	max(max_over_time(mpc_tsb_config_received_count{resource="", component="mpc"}[5m])) by(component)

Config Processing duration

Time it takes to process an entire config set. It shows the details about the amount of time spent pre-processing the configurations, converting them to XCP, and pushing them to the k8s cluster.

These metrics are distributed in buckets to give seconds accuracy, square steps in the dashboard represent a fluctuation between near buckets.

Metric Name	Labels	PromQL Expression
`mpc_config_conversion_duration_bucket`	`component` `error`	histogram_quantile(0.99, sum(rate(mpc_config_conversion_duration_bucket{error="", component="mpc"}[5m])) by (le,component))
`mpc_config_pre_process_duration_bucket`	`component` `error`	histogram_quantile(0.99, sum(rate(mpc_config_pre_process_duration_bucket{error="", component="mpc"}[5m])) by (le,component))
`mpc_config_total_process_duration_bucket`	`component` `error`	histogram_quantile(0.99, sum(rate(mpc_config_total_process_duration_bucket{error="", component="mpc"}[5m])) by (le,component))
`mpc_xcp_config_push_duration_bucket`	`component` `error`	histogram_quantile(0.99, sum(rate(mpc_xcp_config_push_duration_bucket{error="", component="mpc"}[5m])) by (le,component))

Received configs by type

Configuration updates received from TSB are processed by MPC and translated into XCP resources. This metric shows the number of objects of each type MPC will convert.

Metric Name	Labels	PromQL Expression
`mpc_tsb_config_received_count`	`component` `resource`	sum(max_over_time(mpc_tsb_config_received_count{resource!="", component="mpc"}[5m])) by(component, resource)

Total Conversion Time by Type every 5m

Time it takes to convert TSB resources to the XCP APIs.

Metric Name	Labels	PromQL Expression
`mpc_xcp_conversion_duration_sum`	`component`	sum(rate(mpc_xcp_conversion_duration_sum{component="mpc"}[5m])) by (resource)

Conversion Time by Type every 5m

Time it takes to convert TSB resources to the XCP APIs.

Metric Name	Labels	PromQL Expression
`mpc_xcp_conversion_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(mpc_xcp_conversion_duration_bucket[5m])) by (le, resource))

Conversions by Resource every 5m

Conversions by resource executed in a time period. This can be used to understand the throughput of the MPC conversions.

Metric Name	Labels	PromQL Expression
`mpc_xcp_conversion_duration_count`	`component` `resource`	sum(increase(mpc_xcp_conversion_duration_count{resource!="", component="mpc"}[5m])) by (resource)

Conversions Invalidations by Resource every 5m

Conversions Invalidations by resource executed in a time period. This can be used to understand the throughput of the MPC conversions.

Metric Name	Labels	PromQL Expression
`mpc_xcp_conversion_invalidation_duration_count`	`component` `resource`	sum(increase(mpc_xcp_conversion_invalidation_duration_count{component="mpc", resource!=""}[5m])) by (resource)

Conversion Invalidation Time every 5m

Time it takes to invalidate TSB resources for a given conversion to the XCP APIs.

Metric Name	Labels	PromQL Expression
`mpc_xcp_conversion_invalidation_duration_bucket`	`component` `resource`	histogram_quantile(0.99, sum(rate(mpc_xcp_conversion_invalidation_duration_bucket{component="mpc", resource!=""}[5m])) by (le, resource))

Updates from TSB every 5m

Configuration and onboarded cluster messages received from TSB.

The number of update messages may increase or decrease based on the time it takes for MPC to fully process the messages. The more time it takes to process, the less frequent config updates will be retrieved.

Metric Name	Labels	PromQL Expression
`grpc_client_handled_total`	`component` `grpc_code` `grpc_method`	sum(increase(grpc_client_handled_total{component="mpc", grpc_method="GetAllConfigObjects", grpc_code="OK"}[5m])) or on() vector(0)
`grpc_client_handled_total`	`component` `grpc_code` `grpc_method`	sum(increase(grpc_client_handled_total{component="mpc", grpc_method="GetAllClusters", grpc_code="OK"}[5m])) or on() vector(0)
`grpc_client_handled_total`	`component` `grpc_code` `grpc_method`	sum(increase(grpc_client_handled_total{component="mpc", grpc_method="GetAllConfigObjects", grpc_code!="OK"}[5m])) or on() vector(0)
`grpc_client_handled_total`	`component` `grpc_code` `grpc_method`	sum(increase(grpc_client_handled_total{component="mpc", grpc_method="GetAllClusters", grpc_code!="OK"}[5m])) or on() vector(0)

MPC to XCP pushed configs

The number of resources that are pushed to XCP.

This metric shows the number of objects that are created, updated, and deleted as part of a configuration push from MPC to XCP. It also shows how many fetch calls to the k8s api server are done.

This metric can be used together with the TSB tp MPC sent configs and XCP push operations and push duration to get an understanding of how the amount of resources being pushed to XCP affects the time it takes for the entire configuration push operation to complete.

Metric Name	Labels	PromQL Expression
`mpc_xcp_config_create_ops`	`component`	sum(mpc_xcp_config_create_ops{component="mpc"})
`mpc_xcp_config_delete_ops`	`component`	sum(mpc_xcp_config_delete_ops{component="mpc"})
`mpc_xcp_config_fetch_ops`	`component`	sum(mpc_xcp_config_fetch_ops{component="mpc"})
`mpc_xcp_config_update_ops`	`component`	sum(mpc_xcp_config_update_ops{component="mpc"})

MCP to XCP pushed configs error

The number of resources that failed while pushing to XCP.

This metric shows the number of objects that fail when they are tried to be created, updated, and deleted as part of a configuration push from MPC to XCP. It also shows the number of failed fetch calls to the k8s api server.

This metric can be used together with the MPC to TSB push configs and the XCP push operations and push duration to get an understanding of how the amount of resources being pushed to XCP affects the time it takes for the entire configuration push operation to complete.

Metric Name	Labels	PromQL Expression
`mpc_xcp_config_create_ops_err`	`component`	sum(mpc_xcp_config_create_ops_err{component="mpc"})
`mpc_xcp_config_delete_ops_err`	`component`	sum(mpc_xcp_config_delete_ops_err{component="mpc"})
`mpc_xcp_config_fetch_ops_err`	`component`	sum(mpc_xcp_config_fetch_ops_err{component="mpc"})
`mpc_xcp_config_update_ops_err`	`component`	sum(mpc_xcp_config_update_ops_err{component="mpc"})

Config Status updates every 5m

Config Status update messages sent over the gRPC streams, from XCP to MPC to XCP.

This metric can help understand how messages are queued in TSB when it is under load. The value for both metrics should always be the same. If the Received by TSB metric has a value lower than the MPC one, it means TSB is under load and cannot process all messages sent by MPC as fast as MPC is sending them.

Metric Name Labels PromQL Expression

grpc_client_msg_received_total

component grpc_method

sum(increase(grpc_client_msg_received_total{grpc_method="Report",component="mpc"}[5m])) or on() vector(0)

grpc_client_msg_sent_total

component grpc_method

sum(increase(grpc_client_msg_sent_total{grpc_method="PushStatus",component="mpc"}[5m])) or on() vector(0)

grpc_server_msg_received_total

component grpc_method

sum(increase(grpc_server_msg_received_total{grpc_method="PushStatus", component="tsb"}[5m])) or on() vector(0)

Config Status updates processed every 5m

This is the number of config status updates that are processed by the Management Plane Controller (MPC), that are received from XCP and to be sent to TSB.

There are two gRPC streams, one that connects XCP to MPC and another one that connects MPC to TSB.

Metric Name	Labels	PromQL Expression
`permanent_stream_operation_total`	`component` `error` `name`	sum(increase(permanent_stream_operation_total{name="StatusPush", error="", component="mpc"}[5m])) or on() vector(0)
`permanent_stream_operation_total`	`component` `error` `name`	sum(increase(permanent_stream_operation_total{name="StatusPull", error="", component="mpc"}[5m])) or on() vector(0)
`permanent_stream_operation_total`	`component` `error` `name`	sum(increase(permanent_stream_operation_total{name="StatusPush", error!="", component="mpc"}[5m])) or on() vector(0)
`permanent_stream_operation_total`	`component` `error` `name`	sum(increase(permanent_stream_operation_total{name="StatusPull", error!="", component="mpc"}[5m])) or on() vector(0)

Config Status stream connection attempts every 5m

The number of connection (and reconnection) attempts on the config status updates streams. MPC sends the config status updates over a permanently connected gRPC stream to TSB. At the same time, XCP sends them to MPC. This metric shows the number of connections and reconnections that happened on each stream.

Metric Name	Labels	PromQL Expression
`permanent_stream_connection_attempts_total`	`error` `name`	sum(increase(permanent_stream_connection_attempts_total{name="StatusPull", error="" }[5m])) or on() vector(0)
`permanent_stream_connection_attempts_total`	`error` `name`	sum(increase(permanent_stream_connection_attempts_total{name="StatusPull", error!="" }[5m])) or on() vector(0)
`permanent_stream_connection_attempts_total`	`error` `name`	sum(increase(permanent_stream_connection_attempts_total{name="StatusPush", error="" }[5m])) or on() vector(0)
`permanent_stream_connection_attempts_total`	`error` `name`	sum(increase(permanent_stream_connection_attempts_total{name="StatusPush", error!="" }[5m])) or on() vector(0)

TSB Handled Status Reports 5m

Number of config status reports handled by TSB. Each received config status report is either handled or skipped.

Handled: Process and store the status report directly, or if --max-status-report-workers is > 1, enqueued for async processing.
Skipped: For duplicated status reports.

Metric Name	Labels	PromQL Expression
`config_handling_duration_count`	`component`	sum(increase(config_handling_duration_count{component="tsb"}[5m])) by(skip)

Config status cache operations every 5m

Number of operations done in the config status cache when receiving new config statuses.

Metric Name	Labels	PromQL Expression
`config_status_cache_add_total`	`error`	sum(increase(config_status_cache_add_total{error="false"}[5m]))
`config_status_cache_add_total`	`error`	sum(increase(config_status_cache_add_total{error="true"}[5m])) or on() vector(0)
`config_status_cache_check_total`	N/A	sum(increase(config_status_cache_check_total[5m]))
`config_status_cache_check_total`	`error`	sum(increase(config_status_cache_check_total{error="true"}[5m])) or on() vector(0)
`config_status_cache_invalidate_total`	N/A	sum(increase(config_status_cache_invalidate_total[5m]))
`grpc_server_msg_received_total`	`component` `grpc_method`	sum(increase(grpc_server_msg_received_total{grpc_method="PushStatus", component="tsb"}[5m]))

TSB Processed Status Reports in 5m

The number of config status reports processed by TSB. This number must be the same as the number of handled ones without the skipped ones.

Metric Name	Labels	PromQL Expression
`config_status_report_work_duration_bucket`	`component`	histogram_quantile(0.99, sum(rate(config_status_report_work_duration_bucket{component="tsb"}[5m])) by (le, skip))

TSB Handling Status Reports Duration 5m

The P99 duration in milliseconds of handling a received status reports handled by TSB. Each received config status report is either handled or skipped.

Handled: Process and store the status report directly, or if --max-status-report-workers is > 1, enqueued for async processing.
Skipped: For duplicated status reports.

Metric Name	Labels	PromQL Expression
`config_handling_duration_bucket`	`component`	histogram_quantile(0.99, sum(rate(config_handling_duration_bucket{component="tsb"}[5m])) by (le, skip))

Config status cache operations by event type every 5m

Number of operations done in the cache by event type.

This metric helps understand the amount of event processing that can be skipped on the TSB side when receiving events because TSB already knows about them, and help understand how status event reporting relates to load on the TSB side.

Metric Name	Labels	PromQL Expression
`config_status_cache_add_total`	`component` `error`	sum(increase(config_status_cache_add_total{error="false", component="tsb"}[5m])) by (type)
`config_status_cache_check_total`	`component` `error`	sum(increase(config_status_cache_check_total{error="false", component="tsb"}[5m])) by (type)

TSB Processing Status Reports Duration 5m

The P99 duration in milliseconds of processing config status reports handled by TSB. Processing of a config status report involves analyse, apply/storing the result, propagate to parents and dependants.

Metric Name	Labels	PromQL Expression
`config_status_report_work_duration_bucket`	`component`	histogram_quantile(0.99, sum(rate(config_status_report_work_duration_bucket{component="tsb"}[5m])) by (le, skip))

Status Reports Work per Shard Distribution

Only applies when tsb --max-status-report-workers is > 1. Distribution of the status report work across the different shards.

Metric Name	Labels	PromQL Expression
`sharded_queue_work_duration_count`	`component` `name`	sum(increase(sharded_queue_work_duration_count{name="status-reports", component="tsb"}[5m])) by (component, shard)

Work executions every 5m

Only applies when tsb --max-status-report-workers is > 1. Amount of status processing jobs processed

Metric Name	Labels	PromQL Expression
`sharded_queue_work_duration_count`	`component`	sum(increase(sharded_queue_work_duration_count{component="tsb"}[5m])) by (name)

Status updates worker time every 5m

Only applies when tsb --max-status-report-workers is > 1. Time it takes for workers to process a single status update event.

Metric Name	Labels	PromQL Expression
`sharded_queue_work_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(sharded_queue_work_duration_bucket[5m])) by (le, name))

TSB Status Updates Enqueue Delay 5m

Only applies when tsb --max-status-report-workers is > 1. The P99 delay is milliseconds since a config status report started the enqueuing until the queue accepted(enqueued) the config status report.

If the P99 delay goes high it means that that a shard of the queue is filled up reaching its max capacity. If it goes over minutes, it means that there is deadlock probably.

Metric Name	Labels	PromQL Expression
`sharded_queue_enqueue_delay_bucket`	`name`	histogram_quantile(0.99, sum(rate(sharded_queue_enqueue_delay_bucket{name="status-reports"}[5m])) by (le, name))

TSB Status Updates Enqueue Delay 5m

Only applies when tsb --max-status-report-workers is > 1. The delay is milliseconds since a config status report started the enqueuing until the queue accepted(enqueued) the config status report.

The config status report queue has a fixed size of element in can't hold per bucket. If the enqueuing latency goes up, it means that there is contention in the queue and more elements cannot be placed until the queue releases spaces by consuming its enqueued elements.

Metric Name	Labels	PromQL Expression
`sharded_queue_enqueue_delay_bucket`	`name`	sum(rate(sharded_queue_enqueue_delay_bucket{name="status-reports"}[5m])) by (le)

TSB Status Updates Worker Delay 5m

Only applies when tsb --max-status-report-workers is > 1. The delay is milliseconds since a config status report is received and processed by the work queue.

Metric Name	Labels	PromQL Expression
`sharded_queue_work_delay_bucket`	`name`	histogram_quantile(0.99, sum(rate(sharded_queue_work_delay_bucket{name="status-reports"}[5m])) by (le, name))

TSB Status Updates Worker Delay 5m

Only applies when tsb --max-status-report-workers is > 1. The delay is milliseconds since a config status report is received and processed by the work queue.

Metric Name	Labels	PromQL Expression
`sharded_queue_work_delay_bucket`	`name`	sum(rate(sharded_queue_work_delay_bucket{name="status-reports"}[5m])) by (le)

Cluster Status Update from XCP every 5m

Cluster status update messages received from XCP over a gRPC stream.

Metric Name	Labels	PromQL Expression
`grpc_client_msg_received_total`	`component` `grpc_method`	sum(increase(grpc_client_msg_received_total{component="mpc", grpc_method="GetClusterState" }[5m])) or on() vector(0)

Cluster updates from XCP processed every 5m

The number of cluster status updates received by the Management Plane Controller (MPC) from XCP that must be processed and sent to TSB.

XCP sends the cluster status updates (e.g. services deployed in the cluster) over a permanently connected gRPC stream to MPC. This metric shows the number of messages received and processed by MPC on that stream.

Metric Name Labels PromQL Expression

permanent_stream_operation_total

error name

sum(increase(permanent_stream_operation_total{name="ClusterStateFromXCP", error="" }[5m])) or on() vector(0)

permanent_stream_operation_total

error name

sum(increase(permanent_stream_operation_total{name="ClusterStateFromXCP", error!="" }[5m])) or on() vector(0)

XCP cluster status updates Sent to TSB every 5m

This is the number of cluster status updates that are processed by the Management Plane Controller (MPC) to be sent to TSB.

MPC sends the cluster status updates over a gRPC stream that is permanently connected to TSB, and this metric shows the number of cluster updates that are processed by MPC and sent to TSB on that stream.

Metric Name Labels PromQL Expression

permanent_stream_operation_total

error name

sum(increase(permanent_stream_operation_total{name="ClusterUpdates", error=""}[5m])) or on() vector(0)

permanent_stream_operation_total

error name

sum(increase(permanent_stream_operation_total{name="ClusterUpdates", error!=""}[5m])) or on() vector(0)

Cluster status updates to TSB stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster status updates stream. MPC sends the cluster status updates over a permanently connected gRPC stream to TSB. This metric shows the number of connections and reconnections that happened on that stream.

Metric Name Labels PromQL Expression

permanent_stream_connection_attempts_total

error name

sum(increase(permanent_stream_connection_attempts_total{name="ClusterUpdates", error=""}[5m])) or on() vector(0)

permanent_stream_connection_attempts_total

error name

sum(increase(permanent_stream_connection_attempts_total{name="ClusterUpdates", error!=""}[5m])) or on() vector(0)

Cluster updates from XCP stream connection attempts every 5m

The number of connection (and reconnection) attempts on the cluster status updates from XCP stream. XCP sends the cluster status updates over a permanently connected gRPC stream to MPC. This metric shows the number of connections and reconnections that happened on that stream.

Metric Name Labels PromQL Expression

permanent_stream_connection_attempts_total

error name

sum(increase(permanent_stream_connection_attempts_total{name="ClusterStateFromXCP", error="" }[5m])) or on() vector(0)

permanent_stream_connection_attempts_total

error name

sum(increase(permanent_stream_connection_attempts_total{name="ClusterStateFromXCP", error!="" }[5m])) or on() vector(0)

GC Count by Component in Management Plane

Metric Name	Labels	PromQL Expression
`go_gc_duration_seconds_count`	`component` `plane`	sum(rate(go_gc_duration_seconds_count{component=~"tsb\|mpc\|xcp", plane="management"}[5m])) by (component)

GC Duration by Component in Management Plane

Metric Name Labels PromQL Expression

go_gc_duration_seconds_count

component plane

sum(rate(go_gc_duration_seconds_sum{component="tsb|mpc|xcp", plane="management"}[5m]) / rate(go_gc_duration_seconds_count{component="tsb|mpc|xcp", plane="management"}[5m])) by (component)

go_gc_duration_seconds_sum

component plane

sum(rate(go_gc_duration_seconds_sum{component="tsb|mpc|xcp", plane="management"}[5m]) / rate(go_gc_duration_seconds_count{component="tsb|mpc|xcp", plane="management"}[5m])) by (component)

Heap Allocations by Component in Management Plane

Metric Name	Labels	PromQL Expression
`go_memstats_heap_alloc_bytes`	`component` `plane`	sum(max_over_time(go_memstats_heap_alloc_bytes{component=~"tsb\|mpc\|xcp", plane="management"}[5m])) by (component)

Heap Objects by Component in Management Plane

Metric Name	Labels	PromQL Expression
`go_memstats_heap_objects`	`component` `plane`	sum(max_over_time(go_memstats_heap_objects{component=~"tsb\|mpc\|xcp", plane="management"}[5m])) by (component)

Next GC Target by Component in Management Plane

The heap memory size during the next GC cycle. GC is used to guarantee that the value is no less than the value of

Metric Name	Labels	PromQL Expression
`go_memstats_next_gc_bytes`	`component` `plane`	sum(max_over_time(go_memstats_next_gc_bytes{component=~"tsb\|mpc\|xcp", plane="management"}[5m])) by (component)

Heap Utilization Percentage by Component in Management Plane

Metric Name	Labels	PromQL Expression
`go_memstats_heap_idle_bytes`	`component`	( sum(go_memstats_heap_inuse_bytes{component="xcp"}) by (component) / sum(go_memstats_heap_idle_bytes{component="xcp"} + go_memstats_heap_inuse_bytes{component="xcp"}) by (component) ) * 100
`go_memstats_heap_idle_bytes`	`component`	( sum(go_memstats_heap_inuse_bytes{component="tsb"}) by (component) / sum(go_memstats_heap_idle_bytes{component="tsb"} + go_memstats_heap_inuse_bytes{component="tsb"}) by (component) ) * 100
`go_memstats_heap_idle_bytes`	`component`	( sum(go_memstats_heap_inuse_bytes{component="mpc"}) by (component) / sum(go_memstats_heap_idle_bytes{component="mpc"} + go_memstats_heap_inuse_bytes{component="mpc"}) by (component) ) * 100
`go_memstats_heap_inuse_bytes`	`component`	( sum(go_memstats_heap_inuse_bytes{component="xcp"}) by (component) / sum(go_memstats_heap_idle_bytes{component="xcp"} + go_memstats_heap_inuse_bytes{component="xcp"}) by (component) ) * 100
`go_memstats_heap_inuse_bytes`	`component`	( sum(go_memstats_heap_inuse_bytes{component="tsb"}) by (component) / sum(go_memstats_heap_idle_bytes{component="tsb"} + go_memstats_heap_inuse_bytes{component="tsb"}) by (component) ) * 100
`go_memstats_heap_inuse_bytes`	`component`	( sum(go_memstats_heap_inuse_bytes{component="mpc"}) by (component) / sum(go_memstats_heap_idle_bytes{component="mpc"} + go_memstats_heap_inuse_bytes{component="mpc"}) by (component) ) * 100

GC CPU Fraction by Component in Management Plane

Metric Name Labels PromQL Expression

go_gc_duration_seconds_sum

component plane

sum(   rate(go_gc_duration_seconds_sum{component="tsb|mpc|xcp", plane="management"}[5m]) /   rate(process_cpu_seconds_total{component="tsb|mpc|xcp", plane="management"}[5m]) ) by (component)* 100

process_cpu_seconds_total

component plane

sum(   rate(go_gc_duration_seconds_sum{component="tsb|mpc|xcp", plane="management"}[5m]) /   rate(process_cpu_seconds_total{component="tsb|mpc|xcp", plane="management"}[5m]) ) by (component)* 100

Goroutines by Component in Management Plane

Metric Name	Labels	PromQL Expression
`go_goroutines`	`component` `plane`	sum(max_over_time(go_goroutines{component=~"tsb\|mpc\|xcp", plane="management"}[5m])) by (component)

Heap Sys by Component in Management Plane

Metric Name	Labels	PromQL Expression
`go_memstats_heap_sys_bytes`	`component` `plane`	sum(max_over_time(go_memstats_heap_sys_bytes{component=~"tsb\|mpc\|xcp", plane="management"}[5m])) by (component)

gRPC Server Calls Started Rate

The rate of RPCs started on the server.

Metric Name	Labels	PromQL Expression
`grpc_server_started_total`	`component` `grpc_method`	sum(rate(grpc_server_started_total{component="tsb", grpc_method=~"GetAllClusters\|UpdateClusterState\|GetAllConfigObjects"}[5m])) by (grpc_method, component)

gRPC Server Handled Rate

The rate of RPCs completed on the server.

Metric Name	Labels	PromQL Expression
`grpc_server_handled_total`	`component` `grpc_method`	sum(rate(grpc_server_handled_total{component="tsb", grpc_method=~"GetAllClusters\|UpdateClusterState\|GetAllConfigObjects"}[5m])) by (grpc_method, component)

gRPC Client Calls Started Rate

The rate of the RPCs started on the client.

Metric Name	Labels	PromQL Expression
`grpc_client_started_total`	`component`	sum(rate(grpc_client_started_total{component="mpc"}[5m])) by (grpc_method, component)

gRPC Client Handled Rate

The rate of RPCs completed on the client.

Metric Name	Labels	PromQL Expression
`grpc_client_handled_total`	`component`	sum(rate(grpc_client_handled_total{component="mpc"}[5m])) by (grpc_method, component)

gRPC Server Handled Status Rate

Metric Name	Labels	PromQL Expression
`grpc_server_handled_total`	`component` `grpc_method`	sum(rate(grpc_server_handled_total{component="tsb", grpc_method=~"GetAllClusters\|UpdateClusterState\|GetAllConfigObjects"}[5m])) by (component, grpc_code)

gRPC Client Handled Status Rate

Metric Name	Labels	PromQL Expression
`grpc_client_handled_total`	`component`	max(rate(grpc_client_handled_total{component="mpc"}[5m])) by (grpc_code, component)

gRPC Server Msg Sent Rate

Metric Name	Labels	PromQL Expression
`grpc_server_msg_sent_total`	`component` `grpc_method`	sum(rate(grpc_server_msg_sent_total{component="tsb", grpc_method=~"GetAllClusters\|UpdateClusterState\|GetAllConfigObjects"}[5m])) by (grpc_method, component)

gRPC Client Msg Received Rate

Metric Name	Labels	PromQL Expression
`grpc_client_msg_received_total`	`component`	sum(rate(grpc_client_msg_received_total{component="mpc"}[5m])) by (grpc_method, component)

gRPC Client Msg Sent Rate

Metric Name	Labels	PromQL Expression
`grpc_server_msg_sent_total`	`component` `grpc_method`	sum(rate(grpc_server_msg_sent_total{component="tsb", grpc_method=~"GetAllClusters\|UpdateClusterState\|GetAllConfigObjects"}[5m])) by (grpc_method, component)

gRPC Server Msg Received Rate

Metric Name	Labels	PromQL Expression
`grpc_server_msg_received_total`	`component` `grpc_method`	max(rate(grpc_server_msg_received_total{component="tsb", grpc_method=~"GetAllClusters\|UpdateClusterState\|GetAllConfigObjects\|PushStatus"}[5m])) by (grpc_method, component)

OAP Operational Status

Operational metrics to indicate Tetrate Service Bridge OAP stack health.

OAP Request Rate

The request rate to OAP, by status.

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_rq_xx_total`	`envoy_cluster_name` `plane`	sum by (envoy_response_code_class) (rate(envoy_cluster_upstream_rq_xx_total{envoy_cluster_name="oap-grpc", plane="management"}[1m]))

OAP Request Latency

The OAP, request latency.

Metric Name	Labels	PromQL Expression
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	histogram_quantile(0.99, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	histogram_quantile(0.90, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	histogram_quantile(0.75, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))
`envoy_cluster_upstream_rq_time_bucket`	`envoy_cluster_name` `plane`	histogram_quantile(0.50, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="oap-grpc", plane="management"}[1m])) by (le))

OAP Aggregation Request Rate

OAP Aggregation Request Rate, by type:

central aggregation service handler received
central application aggregation received
central service aggregation received

Metric Name	Labels	PromQL Expression
`central_aggregation_handler_total`	N/A	sum(rate(central_aggregation_handler_total[1m]))
`central_app_aggregation_total`	N/A	sum(rate(central_app_aggregation_total[1m]))
`central_service_aggregation_total`	N/A	sum(rate(central_service_aggregation_total[1m]))

OAP Aggregation Rows

Cumulative rate of rows in OAP aggreagation.

Metric Name	Labels	PromQL Expression
`metrics_aggregation_total`	`plane`	sum(rate(metrics_aggregation_total{plane="management"}[1m]))

OAP Mesh Analysis Latency

The process latency of OAP service mesh telemetry streaming process.

Metric Name	Labels	PromQL Expression
`mesh_analysis_latency_bucket`	`component` `plane`	histogram_quantile(0.99, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))
`mesh_analysis_latency_bucket`	`component` `plane`	histogram_quantile(0.95, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))
`mesh_analysis_latency_bucket`	`component` `plane`	histogram_quantile(0.90, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))
`mesh_analysis_latency_bucket`	`component` `plane`	histogram_quantile(0.75, sum(rate(mesh_analysis_latency_bucket{plane="control", component="oap"}[1m])) by (le))

OAP Zipkin Trace Rate

The OAP Zipkin processing trace rate

Metric Name	Labels	PromQL Expression
`trace_in_latency_count`	`plane` `protocol`	sum(rate(trace_in_latency_count{protocol='zipkin-http',plane='control'}[1m]))

OAP Zipkin Trace Latency

The OAP trace processing latency

Metric Name	Labels	PromQL Expression
`trace_in_latency_bucket`	N/A	histogram_quantile(0.99, sum(rate(trace_in_latency_bucket[5m])) by (le))
`trace_in_latency_bucket`	N/A	histogram_quantile(0.95, sum(rate(trace_in_latency_bucket[5m])) by (le))
`trace_in_latency_bucket`	N/A	histogram_quantile(0.90, sum(rate(trace_in_latency_bucket[5m])) by (le))
`trace_in_latency_bucket`	N/A	histogram_quantile(0.75, sum(rate(trace_in_latency_bucket[5m])) by (le))
`trace_in_latency_bucket`	N/A	histogram_quantile(0.50, sum(rate(trace_in_latency_bucket[5m])) by (le))

OAP Zipkin Trace Error Rate

The OAP Zipkin processing trace error rate

Metric Name	Labels	PromQL Expression
`trace_analysys_error_count`	`plane` `protocol`	sum(rate(trace_analysys_error_count{protocol='zipkin-http',plane='control'}[1m]))

JVM Threads

Numbed of threads in OAP JVM

Metric Name	Labels	PromQL Expression
`jvm_threads_current`	`component` `plane`	sum(jvm_threads_current{component="oap", plane="management"})
`jvm_threads_daemon`	`component` `plane`	sum(jvm_threads_daemon{component="oap", plane="management"})
`jvm_threads_deadlocked`	`component` `plane`	sum(jvm_threads_deadlocked{component="oap", plane="management"})
`jvm_threads_peak`	`component` `plane`	sum(jvm_threads_peak{component="oap", plane="management"})

JVM Memory

JVM Memory stats of OAP JVM instances.

Metric Name	Labels	PromQL Expression
`jvm_memory_bytes_max`	`component` `plane`	sum by (area, instance) (jvm_memory_bytes_max{component="oap", plane="management"})
`jvm_memory_bytes_used`	`component` `plane`	sum by (area, instance) (jvm_memory_bytes_used{component="oap", plane="management"})

Segmentation Operational Status

ACL Computation Duration

Time taken to compute ACLs, showing average and p98 latency by reason. Includes both full computation and diff computation durations.

Metric Name	Labels	PromQL Expression
`acl_computation_diff_duration_milliseconds_bucket`	`component`	histogram_quantile(0.98, sum(rate(acl_computation_diff_duration_milliseconds_bucket{component="n2ac"}[1m])) by (reason, le))
`acl_computation_diff_duration_milliseconds_count`	`component`	sum(rate(acl_computation_diff_duration_milliseconds_sum{component="n2ac"}[1m]) / rate(acl_computation_diff_duration_milliseconds_count{component="n2ac"}[1m])) by (reason)
`acl_computation_diff_duration_milliseconds_sum`	`component`	sum(rate(acl_computation_diff_duration_milliseconds_sum{component="n2ac"}[1m]) / rate(acl_computation_diff_duration_milliseconds_count{component="n2ac"}[1m])) by (reason)
`acl_computation_duration_milliseconds_bucket`	`component`	histogram_quantile(0.98, sum(rate(acl_computation_duration_milliseconds_bucket{component="n2ac"}[1m])) by (reason, le))
`acl_computation_duration_milliseconds_count`	`component`	sum(rate(acl_computation_duration_milliseconds_sum{component="n2ac"}[1m]) / rate(acl_computation_duration_milliseconds_count{component="n2ac"}[1m])) by (reason)
`acl_computation_duration_milliseconds_sum`	`component`	sum(rate(acl_computation_duration_milliseconds_sum{component="n2ac"}[1m]) / rate(acl_computation_duration_milliseconds_count{component="n2ac"}[1m])) by (reason)

ACL Computation Rate

Rate of ACL computations per second. Indicates how frequently access control lists are being recalculated.

Metric Name	Labels	PromQL Expression
`acl_computation_count_total`	`component`	sum(rate(acl_computation_count_total{component="n2ac"}[1m]))

ACL Generation Rate

Rate of individual ACL entries being generated per second, broken down by reason.

Metric Name	Labels	PromQL Expression
`acl_computation_access_count_total`	`component`	sum(rate(acl_computation_access_count_total{component="n2ac"}[1m])) by (reason)

ACL Client Msg Receive Rate

Rate of client ACL message received per second. This is client side message received. Unlike the server side, these metrics are pure gRPC streaming data and not indicative of number of ACLs.

Metric Name	Labels	PromQL Expression
`grpc_client_msg_received_total`	`component` `grpc_method`	sum(rate(grpc_client_msg_received_total{component="tsb", grpc_method="Describe2"}[1m]))

ACL Client Start/Handled

Rate of ACL client start vs handled. The ACL client is a streaming client so high rates could indicate an issue with connectivity to the segmentation process.

Metric Name	Labels	PromQL Expression
`grpc_client_handled_total`	`component` `grpc_method`	sum(rate(grpc_client_handled_total{component="tsb", grpc_method="Describe2"}[1m])) by (grpc_code)
`grpc_client_started_total`	`component` `grpc_method`	sum(rate(grpc_client_started_total{component="tsb", grpc_method="Describe2"}[1m]))

State Update Duration by Operation

Average and p98 latency of state update server calls, grouped by operation. Helps identify slow state updates.

Metric Name Labels PromQL Expression

rpc_server_duration_milliseconds_bucket

component rpc_method

histogram_quantile(0.98, sum(rate(rpc_server_duration_milliseconds_bucket{component="n2ac", rpc_method="ExecOp"}[1m])) by (rpc_service, rpc_method, operation_hint, le))

rpc_server_duration_milliseconds_count

component rpc_method

sum(rate(rpc_server_duration_milliseconds_sum{component="n2ac"}[1m]) / rate(rpc_server_duration_milliseconds_count{component="n2ac", rpc_method="ExecOp"}[1m])) by (rpc_service, rpc_method, operation_hint)

rpc_server_duration_milliseconds_sum

component

sum(rate(rpc_server_duration_milliseconds_sum{component="n2ac"}[1m]) / rate(rpc_server_duration_milliseconds_count{component="n2ac", rpc_method="ExecOp"}[1m])) by (rpc_service, rpc_method, operation_hint)

State Update Per Second by Operation

Rate of incoming state update requests per second, grouped by operation. Indicates traffic volume to for each operation type.

Metric Name	Labels	PromQL Expression
`rpc_server_requests_per_rpc_count`	`component` `rpc_method`	sum(rate(rpc_server_requests_per_rpc_count{component="n2ac", rpc_method="ExecOp"}[1m])) by (rpc_method, rpc_service, operation_hint)

State Update Bytes Received by Operation

Average size of incoming state update request payloads per operation. Helps identify operations with large request sizes.

Metric Name Labels PromQL Expression

rpc_server_request_size_bytes_count

component rpc_method

sum(rate(rpc_server_request_size_bytes_sum{component="n2ac", rpc_method="ExecOp"}[1m]) / rate(rpc_server_request_size_bytes_count{component="n2ac", rpc_method="ExecOp"}[1m])) by (rpc_service, rpc_method, operation_hint)

rpc_server_request_size_bytes_sum

component rpc_method

sum(rate(rpc_server_request_size_bytes_sum{component="n2ac", rpc_method="ExecOp"}[1m]) / rate(rpc_server_request_size_bytes_count{component="n2ac", rpc_method="ExecOp"}[1m])) by (rpc_service, rpc_method, operation_hint)

DB Client Operation Duration by Command

Average and p98 latency of database operations by command type. Helps identify slow queries.

Metric Name Labels PromQL Expression

db_client_operation_duration_seconds_bucket

component

histogram_quantile(0.98, sum(rate(db_client_operation_duration_seconds_bucket{component="n2ac"}[1m])) by (operation_type, query_command, le))

db_client_operation_duration_seconds_count

component

sum(rate(db_client_operation_duration_seconds_sum{component="n2ac"}[1m]) / rate(db_client_operation_duration_seconds_count{component="n2ac"}[1m])) by (operation_type, query_command)

db_client_operation_duration_seconds_sum

component

sum(rate(db_client_operation_duration_seconds_sum{component="n2ac"}[1m]) / rate(db_client_operation_duration_seconds_count{component="n2ac"}[1m])) by (operation_type, query_command)

DB Client Operation Rate by Command

Rate of database operations per second by command type, including error rates.

Metric Name Labels PromQL Expression

db_client_operation_count_total

component

sum(rate(db_client_operation_count_total{component="n2ac"}[1m])) by (operation_type, query_command)

db_client_operation_error_count_total

component

sum(rate(db_client_operation_error_count_total{component="n2ac"}[1m])) by (operation_type, query_command)

Connection Usage

Number of active connections, either idle or in use.

Metric Name Labels PromQL Expression

db_client_connections_usage

component db_system_name state

sum(max_over_time(db_client_connections_usage{db_system_name="postgresql", component=~"n2ac", state="idle"}[1m])) by(component)

db_client_connections_usage

component db_system_name state

sum(max_over_time(db_client_connections_usage{db_system_name="postgresql", component=~"n2ac", state="used"}[1m])) by(component)

Connection Open/Max

Max number of connections allowed vs. currently used connections.

Metric Name	Labels	PromQL Expression
`db_client_connection_max`	`component` `db_system_name`	sum(max_over_time(db_client_connection_max{db_system_name="postgresql", component="n2ac"}[1m]))
`db_client_connections_total`	`component` `db_system_name`	sum(max_over_time(db_client_connections_total{db_system_name="postgresql", component="n2ac"}[1m]))

Connections Waited

Time taken to successfully acquire connections from the pool, either waiting for release or constructing new.

Metric Name	Labels	PromQL Expression
`pgx_pool_wait_for_acquire_total`	`component` `db_system_name`	sum(increase(pgx_pool_wait_for_acquire_total{db_system_name="postgresql", component="n2ac"}[1m]))

Connections Created

Number of new connections created by the pool.

Metric Name	Labels	PromQL Expression
`pgx_pool_connections_created_total`	`component` `db_system_name`	sum(increase(pgx_pool_connections_created_total{db_system_name="postgresql", component="n2ac"}[1m]))

Connections Destroyed

Number of connections destroyed by the pool, either due to being idle for too long or exceeding lifetime.

Metric Name Labels PromQL Expression

pgx_pool_connections_destroyed_total

component db_system_name reason

sum(increase(pgx_pool_connections_destroyed_total{db_system_name="postgresql", reason="idletime", component="n2ac"}[1m]))

pgx_pool_connections_destroyed_total

component db_system_name reason

sum(increase(pgx_pool_connections_destroyed_total{db_system_name="postgresql", reason="lifetime", component="n2ac"}[1m]))

Heap Utilization Percentage

Percentage of heap memory in use vs total heap. High values may indicate memory pressure.

Metric Name Labels PromQL Expression

go_memstats_heap_idle_bytes

component

(   sum(go_memstats_heap_inuse_bytes{component="n2ac"}) by (component) /   sum(go_memstats_heap_idle_bytes{component="n2ac"} + go_memstats_heap_inuse_bytes{component=~"n2ac"}) by (component) ) * 100

go_memstats_heap_inuse_bytes

component

(   sum(go_memstats_heap_inuse_bytes{component="n2ac"}) by (component) /   sum(go_memstats_heap_idle_bytes{component="n2ac"} + go_memstats_heap_inuse_bytes{component=~"n2ac"}) by (component) ) * 100

Heap Allocations

Bytes of allocated heap objects.

Metric Name	Labels	PromQL Expression
`go_memstats_heap_alloc_bytes`	`component`	sum(max_over_time(go_memstats_heap_alloc_bytes{component=~"n2ac"}[1m])) by (component)

Heap System

Total bytes of heap memory obtained from the OS. Shows the heap's virtual address space size.

Metric Name	Labels	PromQL Expression
`go_memstats_heap_sys_bytes`	`component`	sum(max_over_time(go_memstats_heap_sys_bytes{component=~"n2ac"}[1m])) by (component)

Heap Objects

Number of allocated heap objects.

Metric Name	Labels	PromQL Expression
`go_memstats_heap_objects`	`component`	sum(max_over_time(go_memstats_heap_objects{component=~"n2ac"}[1m])) by (component)

Stack Memory Used

Memory used by goroutine stacks. Increases with goroutine count and stack depth.

Metric Name	Labels	PromQL Expression
`go_memory_used_bytes`	`go_memory_type` `job`	avg by(job) (go_memory_used_bytes{job=~"n2ac",go_memory_type="stack"})

GC Duration

Average time spent per garbage collection cycle. High values may indicate GC pressure.

Metric Name Labels PromQL Expression

go_gc_duration_seconds_count

component plane

sum(rate(go_gc_duration_seconds_sum{component="n2ac", plane="management"}[1m]) / rate(go_gc_duration_seconds_count{component="n2ac", plane="management"}[1m])) by (component)

go_gc_duration_seconds_sum

component plane

sum(rate(go_gc_duration_seconds_sum{component="n2ac", plane="management"}[1m]) / rate(go_gc_duration_seconds_count{component="n2ac", plane="management"}[1m])) by (component)

GC Rate

Number of garbage collection cycles per second.

Metric Name	Labels	PromQL Expression
`go_gc_duration_seconds_count`	`component` `plane`	sum(rate(go_gc_duration_seconds_count{component=~"n2ac", plane="management"}[1m])) by (component)

GC Target

Heap size target that will trigger the next garbage collection cycle.

Metric Name	Labels	PromQL Expression
`go_memstats_next_gc_bytes`	`component` `plane`	sum(max_over_time(go_memstats_next_gc_bytes{component=~"n2ac", plane="management"}[1m])) by (component)

GC CPU Fraction

Percentage of CPU time spent in garbage collection.

Metric Name Labels PromQL Expression

go_gc_duration_seconds_sum

component plane

sum(   rate(go_gc_duration_seconds_sum{component="n2ac", plane="management"}[1m]) /   rate(process_cpu_seconds_total{component="n2ac", plane="management"}[1m]) ) by (component)* 100

process_cpu_seconds_total

component plane

sum(   rate(go_gc_duration_seconds_sum{component="n2ac", plane="management"}[1m]) /   rate(process_cpu_seconds_total{component="n2ac", plane="management"}[1m]) ) by (component)* 100

Processor Limit

The number of OS threads that can execute user-level Go code simultaneously.

Metric Name	Labels	PromQL Expression
`go_processor_limit`	`component`	avg by(component) (go_processor_limit{component=~"n2ac"})

Goroutine Count

Count of live goroutines.

Metric Name	Labels	PromQL Expression
`go_goroutine_count`	`component` `plane`	avg by(component) (go_goroutine_count{component=~"n2ac", plane="management"})

TSB Health

TSB Health fast diagnosis

MPC Health

MPC Health Status. Three metrics define the health of this component:

If mpc_info stops reporting, then it is KO.
If mpc has create operations errors, then it is degraded (!!!).
If mpc has fetch operations errors, then it is degraded (!!!).
If mpc gRPC streams had more than X errors, then it is degraded (!!!)

If 2 and 3, then it is KO

Metric Name	Labels	PromQL Expression
`mpc_info`	N/A	absent(mpc_info) OR on() vector(0)
`mpc_xcp_config_create_ops_err`	N/A	rate(mpc_xcp_config_create_ops_err[20m]) OR on() vector(1) > 0
`mpc_xcp_config_fetch_ops_err`	N/A	rate(mpc_xcp_config_fetch_ops_err[20m]) OR on() vector(1) > 0
`permanent_stream_operation_total`	`component` `error`	(sum(increase(permanent_stream_operation_total{error!="", component="mpc"}[5m])) or on() vector(0)) > bool 5

XCP Central Health

CentralXCP Health Status.

If the number of grpc connections from Central to edges and MPC is less than 1, then KO. Edge connections of type cluster_state should be 1 for tsb api and 1 for each cluster. Because in previous versions we had some scenarios with negative values, we account for it with the less tha 1.
If the propagation time is greater than 10 seconds, it is degraded (!!!). If it is equal or greater than 20 seconds, then it is KO.

Metric Name Labels PromQL Expression

xcp_central_config_propagation_time_ms_bucket

N/A

max(histogram_quantile(0.95, sum(rate(xcp_central_config_propagation_time_ms_bucket[60m])) by (le, edge))) OR on() vector(0)

xcp_central_current_edge_connections

component connection_type

(sum(xcp_central_current_edge_connections{connection_type="cluster_state", component="xcp"}) OR on() vector(0)) < bool 1

XCP Edge Health

This is a key metric about messages received by central from edges. If some of the edges stop reporting, there's a problem with the edges.
This is a key metric about time passed since edges synced with central. If it is more than 10 minutes, there's a problem with one of the edges.

Metric Name	Labels	PromQL Expression
`xcp_central_config_propagation_event_count_total`	`status` `type`	(min(increase(xcp_central_config_propagation_event_count_total{status="received",type="config_resync_request"}[8m])) OR on() vector(0)) == bool 0
`xcp_central_current_onboarded_edge`	N/A	max (time() - max((increase(xcp_central_current_onboarded_edge_total[2m]) unless increase(xcp_central_current_onboarded_edge[2m]) == 0) + on(edge,instance) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", type="cluster_state"} /1000) by (edge,type,instance) > bool 700)
`xcp_central_current_onboarded_edge_total`	N/A	max (time() - max((increase(xcp_central_current_onboarded_edge_total[2m]) unless increase(xcp_central_current_onboarded_edge[2m]) == 0) + on(edge,instance) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", type="cluster_state"} /1000) by (edge,type,instance) > bool 700)
`xcp_central_last_config_propagation_event_timestamp_ms`	`edge` `type`	max (time() - max((increase(xcp_central_current_onboarded_edge_total[2m]) unless increase(xcp_central_current_onboarded_edge[2m]) == 0) + on(edge,instance) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", type="cluster_state"} /1000) by (edge,type,instance) > bool 700)

TSB API Health

TSB API Health Status.

If OK grpc codes reported by TSB API are 0 or not being reported, this silence indicates an error. IF KO, use tctl or UI to check if tsb api is returning. If everything's alright, any call should set this metric to OK.

Metric Name	Labels	PromQL Expression
`grpc_server_handled_total`	`component` `grpc_code` `grpc_type`	(sum(rate(grpc_server_handled_total{component="tsb", grpc_code="OK", grpc_type="unary"}[5m])) by (grpc_code) OR on() vector(0)) == bool 0

IAM Health

IAM Health Status.

If no authentication operations are reported, IAM is having an issue.
If the difference between the short and middle term latencies for JWT is more than 1 second, then IAM is degraded.
If the difference between the short and middle term latencies for JWT is more than 5 seconds, then IAM is having an issue.
If the difference between the short and middle term latencies for non-JWT is more than 5 seconds, then IAM is degraded.
If the difference between the short and middle term latencies for non-JWT is more than 30 seconds, then IAM is having an issue.

Metric Name	Labels	PromQL Expression
`iam_auth_time_bucket`	`error` `provider`	(abs((histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error="", provider="jwt.Provider"}[5m])) by (le)) OR on() vector(1)) - (histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error="", provider="jwt.Provider"}[30m])) by (le)) OR on() vector(1))) / 1000) > bool 1
`iam_auth_time_bucket`	`error` `provider`	(abs((histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error="", provider="jwt.Provider"}[5m])) by (le)) OR on() vector(1)) - (histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error="", provider="jwt.Provider"}[30m])) by (le)) OR on() vector(1))) / 1000) > bool 5
`iam_auth_time_bucket`	`error` `provider`	(abs((histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error="", provider!="jwt.Provider"}[5m])) by (le)) OR on() vector(1)) - (histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error="", provider!="jwt.Provider"}[30m])) by (le)) OR on() vector(1))) / 1000) > bool 5
`iam_auth_time_bucket`	`error` `provider`	(abs((histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error="", provider!="jwt.Provider"}[5m])) by (le)) OR on() vector(1)) - (histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error="", provider!="jwt.Provider"}[30m])) by (le)) OR on() vector(1))) / 1000) > bool 30
`iam_auth_time_count`	`error`	(max(sum(rate(iam_auth_time_count{error=""}[1m])) by (provider)) OR on() vector(0)) == bool 0

OAP Health

OAP Health Status.

If OAP's JVM are not reported, then OAP in the management plane has an issue.
If the number of reporting clusters to xcp central is less than the number of control planes OAPs, theres an issue with one or more OAPs in the CPs. The dependency on xcp central health is controlled by only accounting for positive differences.

Metric Name Labels PromQL Expression

jvm_threads_current

component plane

(sum(jvm_threads_current{component="oap", plane="management"}) OR on() vector(0)) == bool 0

jvm_threads_current

component plane

count(rate(xcp_central_config_propagation_event_count_total{status="received",type="config_resync_request"}[5m]) OR on() vector(0)) - count(jvm_threads_current{component="oap", plane="control"} OR on() vector(0)) > bool 0

xcp_central_config_propagation_event_count_total

status type

count(rate(xcp_central_config_propagation_event_count_total{status="received",type="config_resync_request"}[5m]) OR on() vector(0)) - count(jvm_threads_current{component="oap", plane="control"} OR on() vector(0)) > bool 0

Front Envoy Health

Front Envoy Health Status. 1.If the difference between the short and the long average response time from its upstream transactions exceeds a given threshold in ms (defined by the divisor).

Metric Name Labels PromQL Expression

envoy_cluster_internal_upstream_rq_time_bucket

component

histogram_quantile(0.95, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[5m])) by (le)) OR on() vector(2000)

envoy_cluster_internal_upstream_rq_time_bucket

component

histogram_quantile(0.95, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[60m])) by (le)) OR on() vector(1000)

TSB Operational Status

Operational metrics to indicate Tetrate Service Bridge API server health.

Front Envoy Success Rate

Rate of successful requests to Front Envoy. This includes all user and cluster requests into the management plane.

Note: This indicates the health of the AuthZ server not whether the user or cluster making the request has the correct permissions.

Metric Name	Labels	PromQL Expression
`envoy_cluster_internal_upstream_rq_total`	`component` `envoy_response_code`	sum(rate(envoy_cluster_internal_upstream_rq_total{envoy_response_code=~"2.\|3.\|401", component="front-envoy"}[1m])) by (envoy_cluster_name)

Front Envoy Error Rate

The error rate of requests to the Front Envoy server. This includes all user and cluster requests into the management plane. Note: This indicates the health of the AuthZ server not whether the user or cluster making the request has the correct permissions.

Metric Name	Labels	PromQL Expression
`envoy_cluster_internal_upstream_rq_total`	`component` `envoy_response_code`	sum(rate(envoy_cluster_internal_upstream_rq_total{envoy_response_code!~"2.\|3.\|401", component="front-envoy"}[1m])) by (envoy_cluster_name, envoy_response_code)

Front Envoy Latency

Front Envoy request latency percentiles.

Metric Name Labels PromQL Expression

envoy_cluster_internal_upstream_rq_time_bucket

component

histogram_quantile(0.99, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[1m])) by (le, envoy_cluster_name))

envoy_cluster_internal_upstream_rq_time_bucket

component

histogram_quantile(0.95, sum(rate(envoy_cluster_internal_upstream_rq_time_bucket{component="front-envoy"}[1m])) by (le, envoy_cluster_name))

TSB Success Rate

Rate of successful requests to the TSB apiserver from the UI and CLI.

Metric Name	Labels	PromQL Expression
`grpc_server_handled_total`	`component` `grpc_code` `grpc_method` `grpc_type`	sum(rate(grpc_server_handled_total{component="tsb", grpc_code="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) by (grpc_method)

TSB Error Rate

Rate of failed requests to the TSB apiserver from the UI and CLI.

Metric Name	Labels	PromQL Expression
`grpc_server_handled_total`	`component` `grpc_code` `grpc_method` `grpc_type`	sum(rate(grpc_server_handled_total{component="tsb", grpc_code!="OK", grpc_type="unary", grpc_method!="SendAuditLog"}[1m])) by (grpc_code, grpc_method)

Authentication Success Rate

The success rate for authentication operations for each type of authentication provider.

Metric Name	Labels	PromQL Expression
`iam_auth_time_count`	`error`	sum(rate(iam_auth_time_count{error=""}[1m])) by (provider)

Authentication Error Rate

The error rate for authentication operations for each type of authentication provider.

Spikes may indicate problems with the provider or the given credentials, such as expired JWT tokens.

Metric Name	Labels	PromQL Expression
`iam_auth_time_count`	`error`	sum(rate(iam_auth_time_count{error!=""}[1m])) by (provider)

Authentication Latency

The latency for authentication operations for each type of authentication provider.

Spikes in the latency may indicate that the authentication provider has a sub-optimal configuration (such as too wide LDAP queries).

Metric Name	Labels	PromQL Expression
`iam_auth_time_bucket`	`error`	histogram_quantile(0.99, sum(rate(iam_auth_time_bucket{error=""}[1m])) by (le, provider))
`iam_auth_time_bucket`	`error`	histogram_quantile(0.95, sum(rate(iam_auth_time_bucket{error=""}[1m])) by (le, provider))

Data Store Operations Rate

Request rate for operations persisting data to the datastore grouped by method and kind.

Metric Name	Labels	PromQL Expression
`persistence_operation_total`	N/A	sum(rate(persistence_operation_total[1m])) by (kind, method)

Data Store Operations Error Rate

The request error rate for operations persisting data to the datastore grouped by method and kind. This graph also includes transactions. These are standard SQL transactions and consists of multiple operations. Note: The graph explicitly excludes "resource not found" errors. A small number of "not found" responses are normal as TSB for optimization often uses Get queries instead of Exists to determine the resource existence.

Metric Name	Labels	PromQL Expression
`persistence_operation_total`	`error` `kind`	sum(rate(persistence_operation_total{error="true", kind!="iam_revoked_token"}[1m])) by (kind, method, error)

Data Store Operations Latency

The request latency for operations persisting data to the datastore grouped by method.

Metric Name	Labels	PromQL Expression
`persistence_operation_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(persistence_operation_duration_bucket[1m])) by (le, method))

Data Store Transaction Rate

The rate of newly creates transactions dry run mode. These are standard SQL transactions and consist of multiple operations.

Metric Name	Labels	PromQL Expression
`persistence_transaction_total`	N/A	sum(rate(persistence_transaction_total[1m])) by (dry_run_mode)

Data Store Transaction Error Rate

The rate of transactions that failed at execution time by dry run mode. These are standard SQL transactions and consist of multiple operations.

Metric Name	Labels	PromQL Expression
`persistence_transaction_total`	`error`	sum(rate(persistence_transaction_total{error="true"}[1m])) by (dry_run_mode)

Data Store Transactions Latency

TheP99 latency of transactions execution grouped by dry run mode.

Metric Name	Labels	PromQL Expression
`persistence_transaction_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(persistence_transaction_duration_bucket[1m])) by (le, dry_run_mode))

Active Transactions

The number of running transactions on the datastore.

This graph shows how many active transactions are running at a given point in time. It helps you understand the load of the system generated by concurrent access to the platform.

Metric Name	Labels	PromQL Expression
`persistence_concurrent_transaction`	N/A	sum(persistence_concurrent_transaction) by (dry_run_mode)

Mismatching Transactions dry run mode

This is the number of child transactions that do not match parent's transaction dry run mode.

This means that the dry run mode has not properly been propagated to child transactions.

Metric Name	Labels	PromQL Expression
`persistence_dry_run_transaction_mismatch_total`	N/A	sum(persistence_dry_run_transaction_mismatch_total)

In Use Connections

The number of connections currently in use.

Metric Name	Labels	PromQL Expression
`db_client_connections_usage`	`component` `db_system_name` `state`	sum(max_over_time(db_client_connections_usage{db_system_name="postgresql", state="used", component="n2ac"}[1m]))
`go_sql_in_use_connections`	`db_name`	sum(max_over_time(go_sql_in_use_connections{db_name="dbpool"}[1m])) by(component)

Open Connections/Max Connections

The number of established connections both in use and idle. Also the maximum allowed number of connections is displayed.

Metric Name	Labels	PromQL Expression
`db_client_connection_max`	`component` `db_system_name`	sum(max_over_time(db_client_connection_max{db_system_name="postgresql", component="n2ac"}[1m]))
`db_client_connections_total`	`component` `db_system_name`	sum(max_over_time(db_client_connections_total{db_system_name="postgresql", component="n2ac"}[1m]))
`go_sql_max_open_connections`	`db_name`	sum(max_over_time(go_sql_max_open_connections{db_name="dbpool"}[1m])) by(component)
`go_sql_open_connections`	`db_name`	sum(max_over_time(go_sql_open_connections{db_name="dbpool"}[1m])) by(component)

Connections Waited

The total number of connections waited for.

Metric Name	Labels	PromQL Expression
`go_sql_wait_count_total`	`db_name`	sum(increase(go_sql_wait_count_total{db_name="dbpool"}[1m])) by(component)
`pgx_pool_wait_for_acquire_total`	`component` `db_system_name`	sum(increase(pgx_pool_wait_for_acquire_total{db_system_name="postgresql", component="n2ac"}[1m]))

Time Waiting for Connections

The total time blocked waiting for a new connection.

Metric Name	Labels	PromQL Expression
`go_sql_wait_duration_seconds_total`	`component` `db_name`	sum(increase(go_sql_wait_duration_seconds_total{component="tsb", db_name="dbpool"}[1m])) by(component)

Created Connections

The number of created connections

Metric Name	Labels	PromQL Expression
`db_sql_created_connections_duration_count`	N/A	sum(increase(db_sql_created_connections_duration_count[1m])) by (component)
`pgx_pool_connections_created_total`	`component` `db_system_name`	sum(increase(pgx_pool_connections_created_total{db_system_name="postgresql", component="n2ac"}[1m]))

Create Connections Latency

The p99 duration in milliseconds for creating new connection.

Metric Name	Labels	PromQL Expression
`db_sql_created_connections_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(db_sql_created_connections_duration_bucket[1m])) by (le, component))

Closed Connections

The number of closed connections by reasons.

Max Idle: The total number of connections closed due to SetMaxIdleConns.
Idle Time: The total number of connections closed due to SetConnMaxIdleTime.
Max Lifetime: The total number of connections closed due to SetConnMaxLifetime.

Metric Name	Labels	PromQL Expression
`go_sql_max_idle_closed_total`	`db_name`	sum(increase(go_sql_max_idle_closed_total{db_name="dbpool"}[1m])) by(component)
`go_sql_max_idle_time_closed_total`	`db_name`	sum(increase(go_sql_max_idle_time_closed_total{db_name="dbpool"}[1m])) by(component)
`go_sql_max_lifetime_closed_total`	`db_name`	sum(increase(go_sql_max_lifetime_closed_total{db_name="dbpool"}[1m])) by(component)
`pgx_pool_connections_destroyed_total`	`component` `db_system_name` `reason`	sum(increase(pgx_pool_connections_destroyed_total{db_system_name="postgresql", reason="idletime", component="n2ac"}[1m]))
`pgx_pool_connections_destroyed_total`	`component` `db_system_name` `reason`	sum(increase(pgx_pool_connections_destroyed_total{db_system_name="postgresql", reason="lifetime", component="n2ac"}[1m]))

Idle Connections

The number of idle connections.

Metric Name	Labels	PromQL Expression
`db_client_connections_usage`	`component` `db_system_name` `state`	sum(max_over_time(db_client_connections_usage{db_system_name="postgresql", state="idle", component="n2ac"}[1m]))
`go_sql_idle_connections`	`db_name`	sum(max_over_time(go_sql_idle_connections{db_name="dbpool"}[1m])) by(component)

Service Registry Operations

This metric shows the amount of operations done by the service registry. The service registry will handle all service changes across the clusters, detecting and persisting them in the database.

Metric Name	Labels	PromQL Expression
`service_registry_operation_duration_count`	`error`	sum(increase(service_registry_operation_duration_count{error=""}[1m])) by (operation)
`service_registry_operation_duration_count`	`error`	sum(increase(service_registry_operation_duration_count{error!=""}[1m])) by (operation)

Service Registry Operations Duration

Duration of operations performed by the service registry.

This graph also includes the total duration of the reconciliation process during which the service registry iterates through all clusters to identify changes that need to be persisted in the database.

Metric Name	Labels	PromQL Expression
`service_registry_operation_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(service_registry_operation_duration_bucket[1m])) by (le, operation))
`service_registry_total_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(service_registry_total_duration_bucket[1m])) by (le))

PDP Success Rate

Successful request rate of PDP grouped by method. NGAC is a graph-based authorization framework that consists of three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph's policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent "access granted" decisions; they represent the access decision requests for which a verdict was obtained. A drop in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads and this is usually a consequence of failures in the PIP. Failures in the PIP for write operations will result in the graph not being properly updated to the latest status, resulting in access decisions based on stale models.

Metric Name	Labels	PromQL Expression
`ngac_pdp_operation_total`	`error`	sum(rate(ngac_pdp_operation_total{error=""}[1m])) by (method)

PDP Error Rate

Rate of errors for PDP requests grouped by method. NGAC is a graph-based authorization framework that consists of three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph's policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent "access granted" decisions; they represent the access decision requests where a verdict was obtained. Failed requests to the PDP show the number of requests from the PEP to the PDP that have failed. They do not represent "access denied" decisions; they represent the access decision requests where a verdict could not be obtained. A rise in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads and this is usually a consequence of failures in the PIP. Failures in the PIP for write operations will result in the graph not being correctly updated to the latest status, resulting in access decisions based on stale models.

Metric Name	Labels	PromQL Expression
`ngac_pdp_operation_total`	`error`	sum(rate(ngac_pdp_operation_total{error!=""}[1m])) by (method)

PDP Latency

PDP latency percentiles grouped by method. NGAC is a graph-based authorization framework that consists of three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph's policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent "access granted" decisions; they represent the access decision requests for which a verdict was obtained. This metric shows the time it takes to get an access decision for authorization requests. Degradation in PDP operations may result in general degradation of the system. PDP latency represents the time it takes to make access decisions, and that will impact user experience since access decisions are made and enforced for every operation.

Metric Name	Labels	PromQL Expression
`ngac_pdp_operation_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(ngac_pdp_operation_duration_bucket[1m])) by (method, le))
`ngac_pdp_operation_duration_bucket`	N/A	histogram_quantile(0.95, sum(rate(ngac_pdp_operation_duration_bucket[1m])) by (method, le))

PIP Success Rate

Successful request rate of PIP grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

PIP operations are executed against the NGAC graph to represent and maintain the objects in the system and the relationships between them.

A drop in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads, or that it is failing to persist data. Read failures may result in failed access decisions (in the PDP) and user interaction with the system may be rejected as well. Failures in write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric Name	Labels	PromQL Expression
`ngac_pip_operation_total`	`error`	sum(rate(ngac_pip_operation_total{error=""}[1m])) by (method)

PIP Latency

PiP latency percentiles grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

This metric shows the time it takes for a PIP operation to complete and, in the case of write operations, to have data persisted in the NGAC graph.

Degradation in PIP operations may result in general degradation of the system. PIP latency represents the time it takes to access the NGAC graph, and this directly affects the PDP when running access decisions. A degraded PIP may result in a degraded PDP, and that will impact user experience, as access decisions are made and enforced for every operation.

Metric Name	Labels	PromQL Expression
`ngac_pip_operation_duration_bucket`	N/A	histogram_quantile(0.99, sum(rate(ngac_pip_operation_duration_bucket[1m])) by (method, le))
`ngac_pip_operation_duration_bucket`	N/A	histogram_quantile(0.95, sum(rate(ngac_pip_operation_duration_bucket[1m])) by (method, le))

PIP Error Rate

Rate of errors for PIP requests grouped by method.

NGAC is a graph based authorization framework that consists on three main components:

Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. This graph is used by the other components of NGAC to perform access decisions.
Policy Decision Point (PDP): Performs access decisions based on the policies configured in the NGAC graph. The PDP is used to perform binary Allow/Deny access decisions (Check), and to determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP).
Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. Successful request do not represent "access granted" decisions; they just represent the access decision requests for which a verdict was obtained.

PIP operations are executed against the NGAC graph to represent and maintain the objects in the system and the relationships between them.

Note: the "Node not found" errors are explicitly excluded as TSB often uses GetNode method instead of Exists to determine the node existence, for the purposes of optimisation.

A general raise in this metric may show that operations against the graph are failing. This may mean that the graph is unavailable for reads, or that it is failing to persist data. Read failures may result in failed access decisions (in the PDP) and user interaction with the system may be rejected as well. Failures in write operations will result in the graph not being properly updated to the latest status and that could result in access decisions based on stale models.

Metric Name	Labels	PromQL Expression
`ngac_pip_operation_total`	`error`	sum(rate(ngac_pip_operation_total{error!="", error!="Node not found"}[1m])) by (method)

Active PIP Transactions

The number of running transactions on the NGAC PIP. NGAC is a graph-based authorization framework that consists on three main components: Policy Information Point (PIP): Maintains the NGAC graph. It creates the nodes and edges in the graph that represents the state of the system. The other components of NGAC use this graph to perform access decisions. Policy Decision Point (PDP): Performs access decisions based on the NGAC graph's policies. The PDP is used to perform binary Allow/Deny access decisions (Check) and determine the objects a user has access to (List). These access decisions are enforced at the Policy Enforcement Point (PEP). Policy Enforcement Point (PEP): Enforces access control by calling the PDP to get an access decision. Successful requests to the PDP show the number of requests that the PEP has successfully made to the PDP. A successful request does not represent "access granted" decisions; they represent the access decision requests for which a verdict was obtained. This metric shows the number of active write operations against the NGAC graph. It can be useful to understand the load of the system generated by concurrent access to the platform.

Metric Name	Labels	PromQL Expression
`ngac_pip_concurrent_transaction`	N/A	sum(ngac_pip_concurrent_transaction)

TSB webhooks

Operational status of the Tetrate Service Bridge webhooks

Management Plane webhook requests

Shows the rate of requests for the webhooks in each management plane component

Metric Name	Labels	PromQL Expression
`controller_runtime_webhook_requests_total`	`plane`	sum by(component) (rate(controller_runtime_webhook_requests_total{plane="management"}[1m]))

Control Plane webhooks requests

Shows the rate of requests for the webhooks in each control plane component for the selected cluster

Metric Name	Labels	PromQL Expression
`controller_runtime_webhook_requests_total`	`cluster_name` `plane`	sum by(component) (rate(controller_runtime_webhook_requests_total{plane="control", cluster_name="$cluster"}[1m]))

Management Plane webhook latency

Shows the latency percentiles across all management plane webhooks

Metric Name Labels PromQL Expression

controller_runtime_webhook_latency_seconds_bucket

plane

histogram_quantile(0.5, sum by(le) (rate(controller_runtime_webhook_latency_seconds_bucket{plane="management"}[1m])))

controller_runtime_webhook_latency_seconds_bucket

plane

histogram_quantile(0.95, sum by(le) (rate(controller_runtime_webhook_latency_seconds_bucket{plane="management"}[1m])))

controller_runtime_webhook_latency_seconds_bucket

plane

histogram_quantile(0.99, sum by(le) (rate(controller_runtime_webhook_latency_seconds_bucket{plane="management"}[1m])))

Control Plane webhook latency

Shows the latency percentiles across all control plane webhooks for the selected cluster

Metric Name Labels PromQL Expression

controller_runtime_webhook_latency_seconds_bucket

cluster_name plane

histogram_quantile(0.5, sum by(le) (rate(controller_runtime_webhook_latency_seconds_bucket{plane="control", cluster_name="$cluster"}[1m])))

controller_runtime_webhook_latency_seconds_bucket

cluster_name plane

histogram_quantile(0.95, sum by(le) (rate(controller_runtime_webhook_latency_seconds_bucket{plane="control", cluster_name="$cluster"}[1m])))

controller_runtime_webhook_latency_seconds_bucket

cluster_name plane

histogram_quantile(0.99, sum by(le) (rate(controller_runtime_webhook_latency_seconds_bucket{plane="control", cluster_name="$cluster"}[1m])))

Management Plane deletion protection webhook

Shows the number of deletion requests denied and invalid requests received by the deletion protection webhook in the management plane

Metric Name	Labels	PromQL Expression
`deletion_protection_webhook_denied_total`	`plane`	sum by(component) (rate(deletion_protection_webhook_denied_total{plane="management"}[1m]))
`deletion_protection_webhook_invalid_total`	`plane`	sum by(component) (rate(deletion_protection_webhook_invalid_total{plane="management"}[1m]))

Control Plane deletion protection webhooks

Shows the number of deletion requests denied and invalid requests received by the deletion protection webhook in the control plane for the selected cluster

Metric Name Labels PromQL Expression

deletion_protection_webhook_denied_total

cluster_name plane

sum by(component) (rate(deletion_protection_webhook_denied_total{plane="control", cluster_name="$cluster"}[1m]))

deletion_protection_webhook_invalid_total

cluster_name plane

sum by(component) (rate(deletion_protection_webhook_invalid_total{plane="control", cluster_name="$cluster"}[1m]))

XCP Central Operational Status

Operational metrics to indicate XCP Central health.

Metric Name	Labels	PromQL Expression
`process_start_time_seconds`	`component` `plane`	time() - process_start_time_seconds{component="xcp",plane="management"}

XCP Central Version

Metric Name	Labels	PromQL Expression
`xcp_central_version`	N/A	label_replace(xcp_central_version, "xcp_version", "$1", "version", "(.*)")

Time since last cluster state received from the edge (seconds)

Since the default cluster state resync time is 10 minutes, any value higher than 600-700 seconds is considered abnormal.

Metric Name Labels PromQL Expression

xcp_central_current_onboarded_edge_total

N/A

time() - max(max((increase(xcp_central_current_onboarded_edge_total[2m]) unless increase(xcp_central_current_onboarded_edge_total[2m]) == 0)) by (edge) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status="received" , type="cluster_state"} /1000) by (edge,type)

xcp_central_last_config_propagation_event_timestamp_ms

edge status type

time() - max(max((increase(xcp_central_current_onboarded_edge_total[2m]) unless increase(xcp_central_current_onboarded_edge_total[2m]) == 0)) by (edge) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status="received" , type="cluster_state"} /1000) by (edge,type)

Time since cluster states were sent to the MPC and Edges clients (seconds)

Metric Name Labels PromQL Expression

xcp_central_current_onboarded_edge

N/A

time() - max((xcp_central_last_cluster_state_event_timestamp_ms / 1000  unless on(peer_cluster_name) label_replace(increase(xcp_central_current_onboarded_edge[5m]),"peer_cluster_name", "$1", "edge", "(.)") == 0) unless on(cluster_state_event_cluster_name) label_replace(increase(xcp_central_current_onboarded_edge[5m]),"cluster_state_event_cluster_name", "$1", "edge", "(.)") == 0) by (peer_cluster_name, cluster_state_event_cluster_name)

xcp_central_last_cluster_state_event_timestamp_ms

N/A

time() - max((xcp_central_last_cluster_state_event_timestamp_ms / 1000  unless on(peer_cluster_name) label_replace(increase(xcp_central_current_onboarded_edge[5m]),"peer_cluster_name", "$1", "edge", "(.)") == 0) unless on(cluster_state_event_cluster_name) label_replace(increase(xcp_central_current_onboarded_edge[5m]),"cluster_state_event_cluster_name", "$1", "edge", "(.)") == 0) by (peer_cluster_name, cluster_state_event_cluster_name)

Time since config resync request is received from the edge (seconds)

Because regular periodic resync requests would be coming, a high value than the resync period, 60 sec default, is not normal.

Metric Name Labels PromQL Expression

xcp_central_current_onboarded_edge_total

N/A

time() - max(max((increase(xcp_central_current_onboarded_edge_total[2m]) unless increase(xcp_central_current_onboarded_edge_total[2m]) == 0)) by (edge) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status="received" , type="config_resync_request"} /1000) by (edge,type)

xcp_central_last_config_propagation_event_timestamp_ms

edge status type

time() - max(max((increase(xcp_central_current_onboarded_edge_total[2m]) unless increase(xcp_central_current_onboarded_edge_total[2m]) == 0)) by (edge) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status="received" , type="config_resync_request"} /1000) by (edge,type)

Time since config CRs sent to the edge (seconds)

Sent: Time since configs like workspaces, traffic groups etc were sent to the edge. In steady state, a very high value is fine

Metric Name Labels PromQL Expression

xcp_central_current_onboarded_edge_total

N/A

time() - max(     max((increase(xcp_central_current_onboarded_edge_total[2m]) unless increase(xcp_central_current_onboarded_edge_total[2m]) == 0)) by (edge) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status="sent"} /1000) by (edge,type)

xcp_central_last_config_propagation_event_timestamp_ms

edge status

time() - max(     max((increase(xcp_central_current_onboarded_edge_total[2m]) unless increase(xcp_central_current_onboarded_edge_total[2m]) == 0)) by (edge) + on(edge) group_right xcp_central_last_config_propagation_event_timestamp_ms{edge!="", status="sent"} /1000) by (edge,type)

messages received by central from edges in last 5 min

Number of times any message is received by central from edges

Messages received by central from any edge are of three types:

Periodic(per minute by default) config resync request
cluster state
Header message to ack the config received

This number is combined count of all three in the last 5 min.

Metric Name	Labels	PromQL Expression
`xcp_central_config_propagation_event_count_total`	`status` `type`	increase(xcp_central_config_propagation_event_count_total{status="received",type="config_resync_request"}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge_total[5m]) == 0
`xcp_central_config_propagation_event_count_total`	`status` `type`	increase(xcp_central_config_propagation_event_count_total{status="received",type="cluster_state"}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge_total[5m]) == 0
`xcp_central_current_onboarded_edge_total`	N/A	increase(xcp_central_config_propagation_event_count_total{status="received",type="config_resync_request"}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge_total[5m]) == 0
`xcp_central_current_onboarded_edge_total`	N/A	increase(xcp_central_config_propagation_event_count_total{status="received",type="cluster_state"}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge_total[5m]) == 0

Number of times config CRs sent by central to the edges in last 5m

Number of times config CRs like workspaces. traffic groups etc sent by central in last 5m

Metric Name Labels PromQL Expression

xcp_central_config_propagation_event_count_total

status

increase(xcp_central_config_propagation_event_count_total{status="sent"}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge_total[5m]) == 0

xcp_central_current_onboarded_edge_total

N/A

increase(xcp_central_config_propagation_event_count_total{status="sent"}[5m]) unless on(edge) increase(xcp_central_current_onboarded_edge_total[5m]) == 0

Config Propagation Latency by Edge

Distribution of time to propagate updates from Central (Management plane) to Edges. If there is no config push in last one minute, you will see all 0s, which is expected.

Metric Name	Labels	PromQL Expression
`xcp_central_config_propagation_time_ms_bucket`	N/A	histogram_quantile(0.99, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
`xcp_central_config_propagation_time_ms_bucket`	N/A	histogram_quantile(0.95, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
`xcp_central_config_propagation_time_ms_bucket`	N/A	histogram_quantile(0.90, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
`xcp_central_config_propagation_time_ms_bucket`	N/A	histogram_quantile(0.75, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
`xcp_central_config_propagation_time_ms_bucket`	N/A	histogram_quantile(0.50, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
`xcp_central_current_onboarded_edge`	N/A	histogram_quantile(0.99, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
`xcp_central_current_onboarded_edge`	N/A	histogram_quantile(0.95, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
`xcp_central_current_onboarded_edge`	N/A	histogram_quantile(0.90, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
`xcp_central_current_onboarded_edge`	N/A	histogram_quantile(0.75, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0
`xcp_central_current_onboarded_edge`	N/A	histogram_quantile(0.50, sum(rate(xcp_central_config_propagation_time_ms_bucket[1m])) by (le, edge)) unless on(edge) increase(xcp_central_current_onboarded_edge[5m]) == 0

Errors in config push REQUESTS to the edges in last 5 minutes

Central enqueues the config push request to the debouncer(an internal component of central) when:

It receives event about config resources from k8s apiserver , or
Any edge connects first time, or
It is handling a periodic resync request from any of the edges.

In either case, if central meets an error in the event handling before en-queuing the config push request to the debouncer, this metric gets incremented. So this panel is inversely related to "config push(to the edges) requests enqueued to debouncer in last 5 min".

Metric Name	Labels	PromQL Expression
`xcp_central_config_update_error_count`	N/A	increase(xcp_central_config_update_error_count[5m]) OR on() vector(0)

config push(to the edges) requests enqueued to debouncer in last 5 min

Number of times central enqueued config push(to the connected edges) request to the debouncer in last 5 min. Along with count of the request, reason for the config push request are also shown.

Note: This metric does not indicate the count of actual config push by central. Because of debouncing, actual config push will generally be lesser than this metric. In other words, this metric shows input events for config push. Output (config push on grpc channels to edges) will be lesser because of debouncer

Reasons could be:

ADD/DELETE/UPDATE : These are the events received by the central from the k8s apiserver. Example: ADD/IngressGateway means count of config push requests enqueued because of new IngressGateway CRs creation at k8s apiserver.
EDGE_RESYNC: This shows the count of config push requests when periodic config resync request from edge triggers config push. This will be non-zero only in rare cases when, for whatever reason, edge reported a stale set of configs and central triggers config push to refresh the configs
EDGE_FIRST_CONNECTION: When any edge connects to central, central syncs config to the edge. In steady state, its count must be 0. If its count is non-zero, that indicates grpc stream between central and edge is in error and getting reconnected.
CENTRAL_RESYNC: central enqueues a config push request every 5 minute to reconcile configs at edges. Note that this will result into actual config push only to those edges which are not actively sending their config version periodically. Since 1.4, edges request config resync and therefore central will actually push configs over grpc as a result of these request only if edge is < 1.4.

Metric Name	Labels	PromQL Expression
`xcp_central_config_update_push_count_total`	N/A	increase(xcp_central_config_update_push_count_total[5m])

Pending configurations (orphan configs)

Pending configurations are configs for which cluster could not be determined yet because the parent resource is missing. These metrics show which configurations are currently in Pending state, and the missing Parent group configuration due to which this is in Pending state.

For more information on the Pending configurations can be found by using the XCP central debug endpoint - /debug/cluster_scoped_configs/?pending=true

Metric Name	Labels	PromQL Expression
`xcp_central_pending_configs_total`	N/A	xcp_central_pending_configs_total

Number of connections(cluster state pushing and config pushing)

Central has two type of grpc connections:

edge_config_distribution: One grpc connection with each edge for pushing user configs like workspace, trafficgroup etc
cluster_state: One grpc connection with each edge for pushing learned cluster state(service discovery) from all other peer edges. In addition, one more grpc connection with the mpc for pushing all the learned cluster states to the tsb server.

count of edge_config_distribution will be equal to the number of edges connected to the central count of cluster_state connections will be one more that count of edge_config_distribution connections because of additional mpc connection.

IMPORTANT NOTE: If the cluster is not onboarded(TSB cluster object missing), but the edge is up and connected to central, in that scenario connection counts will include such edges

Metric Name	Labels	PromQL Expression
`xcp_central_current_edge_connections`	`connection_type`	xcp_central_current_edge_connections{connection_type="edge_config_distribution"} OR on() vector(0)
`xcp_central_current_edge_connections`	`connection_type`	xcp_central_current_edge_connections{connection_type="cluster_state"} OR on() vector(0)

Pending on reference configurations

Pending on reference configurations are configs referring to a missing configuration in the TSB hierarchy. The configs are propagated to edges with missing reference resolution metadata. Currently, only Security Settings refer other configurations. These metrics show which configurations are currently in PendingOnRef state, and the missing Parent group configuration due to which this is in PendingOnRef state.

Metric Name	Labels	PromQL Expression
`xcp_central_pending_on_ref_configs_total`	N/A	xcp_central_pending_on_ref_configs_total

validation webhook passed count in last 5 min

count of requests that validation webhook passed in last 5 minutes by GVK

Metric Name	Labels	PromQL Expression
`xcp_central_validation_webhook_passed_count`	N/A	increase(xcp_central_validation_webhook_passed_count[5m]) OR on() vector(0)

New connections per min(cluster state pushing and config pushing)

In steady state, edges should be reconnecting continuously to central for cluster state and config streams. Therefore, rate must be 0.

Metric Name	Labels	PromQL Expression
`xcp_central_connection_register_count_total`	`connection_type`	rate(xcp_central_connection_register_count_total{connection_type="cluster_state"}[1m]) * 60
`xcp_central_connection_register_count_total`	`connection_type`	rate(xcp_central_connection_register_count_total{connection_type="edge_config_distribution"}[1m]) * 60

Connection Push timeouts[5m]

This Panel represents the connection push timeouts in the last 5 minutes in Central grouped by connection types and peer cluster name.

There are two types of connections: Config Propagation and Cluster State Propagation. If the propagation of data(cluster_state/config_propagation) takes more than the configured timeout duration(2 mins by default), central drops the push. This metric can highlight if pushes of certain connection types are failing in certain edges, helping narrow down the problematic component. The push can get a timeout if the previous send to the edge is taking too long.

Metric Name	Labels	PromQL Expression
`xcp_central_connection_push_timeout_count`	N/A	sum by(connection_type, peer_cluster_name) (increase(xcp_central_connection_push_timeout_count[5m]))

All goroutines

Metric Name	Labels	PromQL Expression
`go_goroutines`	`component` `plane`	go_goroutines{component="xcp",plane="management"}

Rate of webhook validation errors

Rate of webhook validation errors by GVK

Metric Name	Labels	PromQL Expression
`xcp_central_validation_webhook_failed_count`	N/A	increase(xcp_central_validation_webhook_failed_count[5m]) OR on() vector(0)
`xcp_central_validation_webhook_http_error_count`	N/A	increase(xcp_central_validation_webhook_http_error_count[5m]) OR on() vector(0)

Central memory consumption

Metric Name	Labels	PromQL Expression
`go_memstats_heap_inuse_bytes`	`component` `plane`	go_memstats_heap_inuse_bytes{component="xcp",plane="management"}
`go_memstats_stack_inuse_bytes`	`component` `plane`	go_memstats_stack_inuse_bytes{component="xcp",plane="management"}

Central specific goroutines

This shows the number of active goroutines in XCP Central that are responsible for config pushes to edges.

Metric Name	Labels	PromQL Expression
`go_goroutines`	`component` `plane`	increase(go_goroutines{component="xcp",plane="management"}[1m])
`xcp_central_go_routine_count_total`	N/A	increase(xcp_central_go_routine_count_total[1m])

Edges' memory consumption

This shows the current memory usage for all Edges

Metric Name	Labels	PromQL Expression
`go_memstats_heap_inuse_bytes`	`component` `plane`	go_memstats_heap_inuse_bytes{component="xcp",plane="control"}

Central CPU consumption

Metric Name	Labels	PromQL Expression
`process_cpu_seconds_total`	`job`	rate(process_cpu_seconds_total{job="central-xcp"}[1m])

All edges' CPU consumption

Metric Name	Labels	PromQL Expression
`process_cpu_seconds_total`	`job`	rate(process_cpu_seconds_total{job="edge-xcp"}[1m])

XCP Central Coordinator running

This panel represents if the XCP Central Coordinator is running across the Central instances.

Metric Name	Labels	PromQL Expression
`xcp_central_ha_coordinator_up`	N/A	avg by(instance) (xcp_central_ha_coordinator_up)

XCP Central Leader

Metric Name	Labels	PromQL Expression
`xcp_central_ha_coordinator_acting_as_leader`	N/A	avg by(instance) (xcp_central_ha_coordinator_acting_as_leader)

XCP Central Followers

Metric Name	Labels	PromQL Expression
`xcp_central_ha_coordinator_acting_as_follower`	N/A	avg by(instance) (xcp_central_ha_coordinator_acting_as_follower)

XCP Central Coordinator Leader election loop[5m]

This panel represents how many times the XCP Central Coordinator started the leader election loop in the last 5 minutes.

Metric Name	Labels	PromQL Expression
`xcp_central_ha_coordinator_leader_election_loops_total`	N/A	sum by(instance) (increase(xcp_central_ha_coordinator_leader_election_loops_total[5m]))

XCP Central Primary Relay Streams Total

This panel represents how many streams are open in the XCP Central Primary relay server.

As the leader's switch, the instance name will change.

Metric Name	Labels	PromQL Expression
`xcp_central_ha_primary_relay_server_streams_total`	N/A	sum by(instance) (xcp_central_ha_primary_relay_server_streams_total)

Number of currently open relay streams at the H/A Primary (Relay Server)

This panel represents the currently open relay stream at the Primary relay server in the XCP Central instances.

Metric Name	Labels	PromQL Expression
`xcp_central_ha_primary_relay_server_streams_open_count_total`	N/A	sum by(instance) (xcp_central_ha_primary_relay_server_streams_open_count_total)

Relay streams rejected by primary relay server

Total number of relay streams rejected by the H/A Primary (Relay Server) because the current XCP Central instance is not the H/A Leader

Metric Name	Labels	PromQL Expression
`xcp_central_ha_primary_relay_server_streams_rejected_total`	N/A	sum by(instance) (increase(xcp_central_ha_primary_relay_server_streams_rejected_total[5m]))

Number of relay streams discontinued by primary relay server

Total number of relay streams closed forcibly by the H/A Primary (Relay Server) because current XCP Central instance stopped being the H/A Leader

Metric Name	Labels	PromQL Expression
`xcp_central_ha_primary_relay_server_streams_discontinued_total`	N/A	xcp_central_ha_primary_relay_server_streams_discontinued_total

Total cluster states sent by primary relay server(Leader->Follower)[5m]

Total number of Cluster state updates pushed by the H/A Primary (Relay Server) to H/A Secondary(s) (Relay Client(s))

Metric Name	Labels	PromQL Expression
`xcp_central_ha_primary_relay_server_cluster_states_sent_total`	N/A	sum by(instance) (increase(xcp_central_ha_primary_relay_server_cluster_states_sent_total[5m]))

Cluster States received by primary relay server(Follower->Leader)[5m]

Total number of Cluster state updates received by the H/A Primary (Relay Server) from H/A Secondary(s) (Relay Client(s))

Metric Name	Labels	PromQL Expression
`xcp_central_ha_primary_relay_server_cluster_states_received_total`	N/A	sum by(instance) (increase(xcp_central_ha_primary_relay_server_cluster_states_received_total[5m]))

Number of cluster state is sent by primary to Secondary for Different CPs[5m]

Total number of times a Cluster state has been pushed by the H/A Primary (Relay Server) to H/A Secondary(s) (Relay Client(s))

Metric Name	Labels	PromQL Expression
`xcp_central_ha_primary_relay_server_cluster_state_sent_total`	N/A	sum by(cluster) (increase(xcp_central_ha_primary_relay_server_cluster_state_sent_total[5m]))

Number of cluster states received by primary from secondaries for different CP[5m]

Total number of times a Cluster state has been received by the H/A Primary (Relay Server) from H/A Secondary(s) (Relay Client(s))

Metric Name	Labels	PromQL Expression
`xcp_central_ha_primary_relay_server_cluster_state_received_total`	N/A	sum by(cluster) (increase(xcp_central_ha_primary_relay_server_cluster_state_received_total[5m]))

Secondary relay client running

Flag indicating whether H/A Secondary (Relay Client) is running

Metric Name	Labels	PromQL Expression
`xcp_central_ha_secondary_relay_client_up`	N/A	avg by(instance) (xcp_central_ha_secondary_relay_client_up)

Total number of Second relay client stream

Total number of relay streams opened by the H/A Secondary (Relay Client)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_secondary_relay_client_streams_total`	N/A	avg by(instance) (xcp_central_ha_secondary_relay_client_streams_total)

Number of open relay stream by secondary relay client

Number of currently open relay streams by the H/A Secondary (Relay Client)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_secondary_relay_client_streams_open_count_total`	N/A	sum by(instance) (xcp_central_ha_secondary_relay_client_streams_open_count_total)

Number of cluster state updates recvd by secondary relay client from Primary[5m]

Total number of Cluster state updates received by the H/A Secondary (Relay Client) from the H/A Primary (Relay Server)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_secondary_relay_client_cluster_states_received_total`	N/A	sum by(instance) (increase(xcp_central_ha_secondary_relay_client_cluster_states_received_total[5m]))

Number of cluster states sent by secondary relay client to primary[5m]

Total number of Cluster state updates pushed by the H/A Secondary (Relay Client) to the H/A Primary (Relay Server)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_secondary_relay_client_cluster_states_sent_total`	N/A	sum by(instance) (increase(xcp_central_ha_secondary_relay_client_cluster_states_sent_total[5m]))

Number of cluster state updates recvd by secondary client from primary for different CPs[5m]

Total number of times a Cluster state has been received by the H/A Secondary (Relay Client) from the H/A Primary (Relay Server)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_secondary_relay_client_cluster_state_received_total`	N/A	sum by(cluster) (increase(xcp_central_ha_secondary_relay_client_cluster_state_received_total[5m]))

Number of cluster state updates sent by secondary relay to primary relay for different CPs[5m]

Total number of times a Cluster state has been pushed by the H/A Secondary (Relay Client) to the H/A Primary (Relay Server)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_secondary_relay_client_cluster_state_sent_total`	N/A	sum by(cluster) (increase(xcp_central_ha_secondary_relay_client_cluster_state_sent_total[5m]))

XCP Central Cross Partition enabled

Flag indicating whether support for Cross-Partition H/A is enabled

Metric Name	Labels	PromQL Expression
`xcp_central_ha_cross_partition_enabled`	N/A	avg(xcp_central_ha_cross_partition_enabled)

Total number of relay streams by H/A Cross-Paritition Requestor

Total number of relay streams opened by the H/A Cross-Partition Requestor (Relay Client)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_cross_partition_requestor_relay_client_streams_total`	N/A	avg by(instance) (xcp_central_ha_cross_partition_requestor_relay_client_streams_total)

Number of open streams by Cross-partition requestor

Number of currently open relay streams by the H/A Cross-Partition Requestor (Relay Client)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_cross_partition_requestor_relay_client_streams_open_count`	N/A	sum by(instance) (xcp_central_ha_cross_partition_requestor_relay_client_streams_open_count)

Number of cluster state updates recvd by H/A Cross-partition requestor[5m]

Total number of Cluster state updates received by the H/A Cross-Partition Requestor (Relay Client) from the H/A Cross-Partition Responder (Relay Server)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_cross_partition_requestor_relay_client_cluster_states_received_total`	N/A	sum by(instance) (increase(xcp_central_ha_cross_partition_requestor_relay_client_cluster_states_received_total[5m]))

Number of cluster state updates recvd by H/A Cross partition requestor for different CPs[5m]

Total number of times a Cluster state has been received by the H/A Cross-Partition Requestor (Relay Client) from the H/A Cross-Partition Responder (Relay Server)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_cross_partition_requestor_relay_client_cluster_state_received_total`	N/A	sum by(cluster) (increase(xcp_central_ha_cross_partition_requestor_relay_client_cluster_state_received_total[5m]))

Total number of relay streams by H/A Cross-Paritition Responder

Total number of relay streams opened by the H/A Cross-Partition Responder (Relay Client)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_cross_partition_responder_relay_server_streams_total`	N/A	avg by(instance) (xcp_central_ha_cross_partition_responder_relay_server_streams_total)

Number of open streams by Cross-partition responder

Number of currently open relay streams at the H/A Cross-Partition Responder (Relay Server)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_cross_partition_requestor_relay_client_streams_open_count`	N/A	sum by(instance) (xcp_central_ha_cross_partition_requestor_relay_client_streams_open_count)

Number of cluster state updates sent by H/A Cross-partition responder[5m]

Total number of Cluster state updates sent by the H/A Cross-Partition Responder (Relay Client) to the H/A Cross-Partition Requestor (Relay Server)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_cross_partition_responder_relay_server_cluster_states_sent_total`	N/A	sum by(instance) (increase(xcp_central_ha_cross_partition_responder_relay_server_cluster_states_sent_total[5m]))

Number of cluster state updates sent by H/A Cross partition responder for different CPs[5m]

Total number of times a Cluster state has been sent by the H/A Cross-Partition Responder (Relay Client) to the H/A Cross-Partition Requestor (Relay Server)

Metric Name	Labels	PromQL Expression
`xcp_central_ha_cross_partition_responder_relay_server_cluster_state_sent_total`	N/A	sum by(cluster) (increase(xcp_central_ha_cross_partition_responder_relay_server_cluster_state_sent_total[5m]))

Active Server Streams to XCP Central — Resource Exchange

Number of currently open Resource Exchange streams that the XCP Central is holding, broken down by edge cluster. In steady state each value should be at least 1 per connected edge; a drop to 0 means that edge is currently disconnected on this stream type.

Metric Name	Labels	PromQL Expression
`xcp_central_resource_exchange_server_streams_open_count`	N/A	sum by (edge_cluster_name) (xcp_central_resource_exchange_server_streams_open_count)

Active Server Streams to XCP Central — Cluster State

Number of currently open Cluster State streams that the XCP Central is holding, broken down by edge cluster. In steady state each value should be at least 1 per connected edge; a drop to 0 means that edge is currently disconnected on this stream type.

Metric Name	Labels	PromQL Expression
`xcp_central_cluster_state_server_streams_open_count`	N/A	sum by (edge_cluster_name) (xcp_central_cluster_state_server_streams_open_count)

Active Server Streams to XCP Central — Config Status

Number of currently open Config Status streams that the XCP Central is holding, broken down by edge cluster. In steady state each value should be at least 1 per connected edge; a drop to 0 means that edge is currently disconnected on this stream type.

Metric Name	Labels	PromQL Expression
`xcp_central_config_status_server_streams_open_count`	N/A	sum by (edge_cluster_name) (xcp_central_config_status_server_streams_open_count)

Number of new server streams to XCP Central [per aggregation interval]

Number of times each long-lived gRPC stream to this XCP Central has been (re)opened over the panel's aggregation interval. A persistently high value indicates connection instability between edges and central for that stream type.

Requires Prometheus >= 3.4.0 with --enable-feature=promql-duration-expr (this panel multiplies rate(...) by 1m, which only parses on Prometheus builds with PromQL duration-expression support enabled).

Metric Name	Labels	PromQL Expression
`xcp_central_aggregate_config_status_server_streams_total`	N/A	sum(rate(xcp_central_aggregate_config_status_server_streams_total[1m]))*1m or vector(0)
`xcp_central_cluster_state_server_streams_total`	N/A	sum(rate(xcp_central_cluster_state_server_streams_total[1m]))*1m or vector(0)
`xcp_central_config_status_server_streams_total`	N/A	sum(rate(xcp_central_config_status_server_streams_total[1m]))*1m or vector(0)
`xcp_central_resource_exchange_server_streams_total`	N/A	sum(rate(xcp_central_resource_exchange_server_streams_total[1m]))*1m or vector(0)

XCP Central Operator Feature Flag (Boolean)

Metric Name	Labels	PromQL Expression
`xcp_central_operator_feature_enabled`	N/A	max by (name) ( xcp_central_operator_feature_enabled )

XCP Central Operator Feature Flag (Numeric)

Metric Name	Labels	PromQL Expression
`xcp_central_operator_feature_value`	N/A	max by (name) ( xcp_central_operator_feature_value )

XCP Central Operator Feature Flag (Seconds)

Metric Name	Labels	PromQL Expression
`xcp_central_operator_feature_value_seconds`	N/A	max by (name) ( xcp_central_operator_feature_value_seconds )

XCP Central Operator Feature Flag (String)

Metric Name	Labels	PromQL Expression
`xcp_central_operator_feature_value_string`	N/A	xcp_central_operator_feature_value_string

XCP Edge Operator

Installed Gateways

Number of gateways installed and actively managed by the current Istio revision.

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_gateway_installed_total`	`cluster_name`	sum by (gateway_type) (xcp_edge_operator_gateway_installed_total{cluster_name="$cluster"})

Gateways Ignored (Revision Mismatch)

Gateways ignored because they belong to a different Istio revision. A non-zero value indicates an upgrade is in progress.

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_gateway_ignored_total`	`cluster_name` `reason`	sum by (gateway_type) (xcp_edge_operator_gateway_ignored_total{cluster_name="$cluster", reason="revision_mismatch"})

Paused Gateways

Gateways with reconciliation paused. A paused gateway will not be upgraded until reconciliation is resumed. Non-zero values may indicate a forgotten freeze after an upgrade.

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_gateway_reconcile_paused`	`cluster_name`	sum(xcp_edge_operator_gateway_reconcile_paused{cluster_name="$cluster"}) or vector(0)

Gateways in Dirty State

Gateways in dirty state. A gateway is dirty when its desired state does not match the applied state. Dirty gateways block reconciliation for newer gateways in the same namespace.

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_gateway_dirty_state`	`cluster_name`	sum(xcp_edge_operator_gateway_dirty_state{cluster_name="$cluster"}) or vector(0)

Gateway Reconcile Rate (Success vs Failure)

Rate of gateway reconciliations performed by the Edge Operator, grouped by gateway type and result. Failures during upgrades may indicate incompatible configurations or image pull issues.

Metric Name Labels PromQL Expression

xcp_edge_operator_gateway_reconcile_success_total

cluster_name

sum by (gateway_type) (increase(xcp_edge_operator_gateway_reconcile_success_total{cluster_name="$cluster"}[1m]))

xcp_edge_operator_gateways_reconcile_failure_total

cluster_name

sum by (gateway_type) (increase(xcp_edge_operator_gateways_reconcile_failure_total{cluster_name="$cluster"}[1m]))

Gateway Reconcile Time (p50/p95/p99)

Gateway reconcile latency percentiles. Elevated times during upgrades can indicate resource contention or slow image pulls.

Metric Name Labels PromQL Expression

xcp_edge_operator_gateway_reconcile_time_ms_bucket

cluster_name

histogram_quantile(0.50, sum by (le, gateway_type) (rate(xcp_edge_operator_gateway_reconcile_time_ms_bucket{cluster_name="$cluster"}[1m])))

xcp_edge_operator_gateway_reconcile_time_ms_bucket

cluster_name

histogram_quantile(0.95, sum by (le, gateway_type) (rate(xcp_edge_operator_gateway_reconcile_time_ms_bucket{cluster_name="$cluster"}[1m])))

xcp_edge_operator_gateway_reconcile_time_ms_bucket

cluster_name

histogram_quantile(0.99, sum by (le, gateway_type) (rate(xcp_edge_operator_gateway_reconcile_time_ms_bucket{cluster_name="$cluster"}[1m])))

Skipped Reconciliations by Reason

Rate of gateway reconciliations skipped, broken down by reason. During upgrades, watch for:

already_reconciled: gateway already at the latest observed generation, no work needed (expected to dominate at steady state)
dirty_skipped: skipped because a previous gateway in the namespace is in dirty state
revision_api_disabled: the revision has reconciliation disabled
namespace_api_disabled: a namespace override has disabled reconciliation
object_label_disabled: reconciliation disabled via a label on the gateway object Note: gateways belonging to a different revision are counted as revision_mismatch in the 'Gateways Ignored' stat, not here.

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_gateway_reconcile_skipped_total`	`cluster_name` `gateway_namespace`	sum by (reason, gateway_type) (increase(xcp_edge_operator_gateway_reconcile_skipped_total{cluster_name="$cluster", gateway_namespace=~"$namespace"}[1m]))

Force Reconcile Triggers

Rate of force reconcile operations triggered via the reconcile-before label. Operators typically use this to manually push a gateway upgrade forward.

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_gateway_force_reconcile_total`	`cluster_name` `gateway_namespace`	sum by (gateway_type) (increase(xcp_edge_operator_gateway_force_reconcile_total{cluster_name="$cluster", gateway_namespace=~"$namespace"}[1m]))

Paused Gateways

List of gateways with reconciliation currently paused. Paused gateways will not be reconciled until the pause is lifted. Alert XCP-15 fires when a gateway has been paused for more than 7 days.

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_gateway_reconcile_paused`	`cluster_name` `gateway_namespace`	xcp_edge_operator_gateway_reconcile_paused{cluster_name="$cluster", gateway_namespace=~"$namespace"} == 1

Gateways in Dirty State

List of gateways currently in dirty state. A dirty gateway has a desired state that does not match its applied state, blocking reconciliation for subsequent gateways in the same namespace.

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_gateway_dirty_state`	`cluster_name` `gateway_namespace`	xcp_edge_operator_gateway_dirty_state{cluster_name="$cluster", gateway_namespace=~"$namespace"} == 1

Time Since Last Gateway Reconcile

Time elapsed since each gateway was last reconciled. Stale values (large times) indicate gateways that have not been reconciled recently, which may mean they are stuck mid-upgrade.

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_last_gateway_reconcile_timestamp_ms`	`cluster_name` `gateway_namespace`	time() - (xcp_edge_operator_last_gateway_reconcile_timestamp_ms{cluster_name="$cluster", gateway_namespace=~"$namespace"} / 1000)

XCP Edge status

Metric Name	Labels	PromQL Expression
`process_start_time_seconds`	`cluster_name` `component`	time() - process_start_time_seconds{cluster_name="$cluster",component="xcp"}

XCP Edge Version

Metric Name Labels PromQL Expression

xcp_edge_istio_versions

cluster_name

label_replace(xcp_edge_istio_versions{cluster_name="$cluster"}, "istio_versions", "$1", "version", "(.*)")

xcp_edge_version

cluster_name

label_replace(xcp_edge_version{cluster_name="$cluster"}, "xcp_version", "$1", "version", "(.*)")

Number of gatewayHost exposed

Metric Name	Labels	PromQL Expression
`xcp_edge_gateway_hosts_count`	`cluster_name`	xcp_edge_gateway_hosts_count{cluster_name="$cluster"}

Time since any message sent to central on config stream (seconds)

Time since any of the following messages is sent by edge to central:

Periodic(per minute) config resync request
Ack of last config received
Cluster state Because regular periodic resync requests would be going out periodically, a high value than the resync period, 60 sec default, is not normal.

Metric Name	Labels	PromQL Expression
`xcp_edge_last_config_resync_to_central_timestamp_ms`	`cluster_name`	time() - xcp_edge_last_config_resync_to_central_timestamp_ms{cluster_name="$cluster"} / 1000
`xcp_edge_last_push_to_central_timestamp_ms`	`cluster_name`	time() - xcp_edge_last_push_to_central_timestamp_ms{cluster_name="$cluster"} / 1000

cluster-state build time percentiles(in secs)

Time (in ms) taken to build the local cluster state. This build time is subset of cluster-update-propagation time

Metric Name	Labels	PromQL Expression
`xcp_edge_cluster_state_build_time_secs_bucket`	`cluster_name`	histogram_quantile(0.5,sum(xcp_edge_cluster_state_build_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_cluster_state_build_time_secs_bucket`	`cluster_name`	histogram_quantile(0.9,sum(xcp_edge_cluster_state_build_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_cluster_state_build_time_secs_bucket`	`cluster_name`	histogram_quantile(0.95,sum(xcp_edge_cluster_state_build_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_cluster_state_build_time_secs_bucket`	`cluster_name`	histogram_quantile(0.99,sum(xcp_edge_cluster_state_build_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_cluster_state_build_time_secs_bucket`	`cluster_name`	histogram_quantile(1,sum(xcp_edge_cluster_state_build_time_secs_bucket{cluster_name="$cluster"}) by (le))

Number of times cluster states sent to central

Metric Name	Labels	PromQL Expression
`xcp_edge_local_cluster_update_propagation_time_secs_count`	`cluster_name`	sum(xcp_edge_local_cluster_update_propagation_time_secs_count{cluster_name="$cluster"}) by (trigger_reason)

cluster-state propagation delay percentiles(in secs)

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name	Labels	PromQL Expression
`xcp_edge_local_cluster_update_propagation_time_secs_bucket`	`cluster_name`	histogram_quantile(0.5,sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_local_cluster_update_propagation_time_secs_bucket`	`cluster_name`	histogram_quantile(0.9,sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_local_cluster_update_propagation_time_secs_bucket`	`cluster_name`	histogram_quantile(0.95,sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_local_cluster_update_propagation_time_secs_bucket`	`cluster_name`	histogram_quantile(0.99,sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_local_cluster_update_propagation_time_secs_bucket`	`cluster_name`	histogram_quantile(1,sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster"}) by (le))

propagated to central in 0-1.5 secs

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="1.5"}) by (trigger_reason)

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="4"} - ignoring(le) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="2"}

Number of times config status sent by edge to central in last 5 min

Number of times config statuses are sent by edge to central, with respective objects' Kind.

Messages received by central from any edge are of three types:

Periodic(per minute by default) config resync request
cluster state
Header message to ack the config received

This number is combined count of all three in the last 5 min.

Metric Name	Labels	PromQL Expression
`xcp_edge_config_status_updates_sent_gvk_total`	`cluster_name`	increase(xcp_edge_config_status_updates_sent_gvk_total{cluster_name="$cluster"}[5m])

propagated to central in 1.5-2.5 secs

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="2.5"} - ignoring(le,cluster_name) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="1.5"}) by (trigger_reason)

propagated to central in 2.5-4 secs

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="4"} - ignoring(le,cluster_name) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="2.5"}) by (trigger_reason)

Length of cluster state event queue

Length of the cluster state events queue at the current moment. This metric is useful to track the cluster state events currently in the queue and ready to be dequeued and sent to central. A high value of this metric means that events are getting enqueued but dequeuing is blocked because of some bottleneck at sending to the central part.

Metric Name	Labels	PromQL Expression
`xcp_edge_current_state_state_events_queue_len`	`cluster_name`	xcp_edge_current_state_state_events_queue_len{cluster_name="$cluster"}

propagated to central in 4-7 secs

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="7"} - ignoring(le) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="4"}) by (trigger_reason)

propagated to central in 11-15 secs

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="15"} - ignoring(le) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="11"}) by (trigger_reason)

propagated to central in 7-11 secs

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="11"} - ignoring(le) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="7"}) by (trigger_reason)

propagated to central in 15-20 secs

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="20"} - ignoring(le) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="15"}) by (trigger_reason)

propagated to central in 20-30 secs

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="30"} - ignoring(le) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="20"}) by (trigger_reason)

propagated to central in 30-40 secs

Time (in secs) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="40"} - ignoring(le) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="30"}) by (trigger_reason)

propagated to central in more than 40 secs

Time (in ms) taken to propagate a change in the local cluster state to remote central

Metric Name Labels PromQL Expression

xcp_edge_local_cluster_update_propagation_time_secs_bucket

cluster_name le

sum(xcp_edge_local_cluster_update_propagation_time_secs_count{cluster_name="$cluster"} - ignoring(le) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="40"}) by (trigger_reason)

xcp_edge_local_cluster_update_propagation_time_secs_count

cluster_name

sum(xcp_edge_local_cluster_update_propagation_time_secs_count{cluster_name="$cluster"} - ignoring(le) xcp_edge_local_cluster_update_propagation_time_secs_bucket{cluster_name="$cluster", le="40"}) by (trigger_reason)

Number of times cluster states received by edge from central in last 1 min

Number of times cluster states are received by edge from central in the last 1 min.

Metric Name	Labels	PromQL Expression
`xcp_edge_cluster_state_received_from_central_count_total`	`cluster_name`	increase(xcp_edge_cluster_state_received_from_central_count_total{cluster_name="$cluster"}[1m])

config translation duration percentiles(in ms)

Total time taken in completing Istio translation for all the app namespaces

Metric Name	Labels	PromQL Expression
`xcp_edge_total_translation_time_in_ms_bucket`	`cluster_name`	histogram_quantile(0.5,sum(xcp_edge_total_translation_time_in_ms_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_total_translation_time_in_ms_bucket`	`cluster_name`	histogram_quantile(0.9,sum(xcp_edge_total_translation_time_in_ms_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_total_translation_time_in_ms_bucket`	`cluster_name`	histogram_quantile(0.95,sum(xcp_edge_total_translation_time_in_ms_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_total_translation_time_in_ms_bucket`	`cluster_name`	histogram_quantile(0.99,sum(xcp_edge_total_translation_time_in_ms_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_total_translation_time_in_ms_bucket`	`cluster_name`	histogram_quantile(1,sum(xcp_edge_total_translation_time_in_ms_bucket{cluster_name="$cluster"}) by (le))

Number of times config CRs received by edge from central in last 5 min

Number of times cluster states are received by edge from central

Metric Name	Labels	PromQL Expression
`xcp_edge_config_updates_received_count_total`	`cluster_name`	increase(xcp_edge_config_updates_received_count_total{cluster_name="$cluster"}[5m])

Translation count per min

Number of Istio config translations in Edge per namespace per min

Metric Name	Labels	PromQL Expression
`xcp_edge_istio_translations_count_total`	`cluster_name`	increase(xcp_edge_istio_translations_count_total{cluster_name="$cluster"}[1m])

Number of configs created/updated by edge at k8s apiserver every 5 minutes

Shows the activity of Edge creating objects in K8s API, grouped by object kind.

Metric Name Labels PromQL Expression

xcp_edge_cr_added_total

cluster_name

increase(xcp_edge_cr_added_total{cluster_name="$cluster"}[5m]) OR increase(xcp_edge_cr_updated_total{cluster_name="$cluster"}[5m])

xcp_edge_cr_updated_total

cluster_name

increase(xcp_edge_cr_added_total{cluster_name="$cluster"}[5m]) OR increase(xcp_edge_cr_updated_total{cluster_name="$cluster"}[5m])

Number of configs deleted by edge from k8s apiserver every 5 minutes

Shows the activity of Edge deleting objects in K8s API, grouped by object kind.

Metric Name	Labels	PromQL Expression
`xcp_edge_cr_deleted_total`	`cluster_name`	increase(xcp_edge_cr_deleted_total{cluster_name="$cluster"}[5m])

k8s config apply duration P50 percentiles for each namespace (in ms)

Metric Name	Labels	PromQL Expression
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(0.5,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le,namespace))
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(0.9,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(0.95,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(0.99,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(1,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le))

k8s config apply duration P90 percentiles for each namespace (in ms)

Metric Name	Labels	PromQL Expression
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(0.9,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le,namespace))

k8s config apply duration P99 percentiles for each namespace (in ms)

Metric Name	Labels	PromQL Expression
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(0.99,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le,namespace))
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(0.9,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(0.95,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(0.99,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le))
`xcp_edge_k8s_configs_apply_time_secs_bucket`	`cluster_name`	histogram_quantile(1,sum(xcp_edge_k8s_configs_apply_time_secs_bucket{cluster_name="$cluster"}) by (le))

All goroutines

Metric Name	Labels	PromQL Expression
`go_goroutines`	`cluster_name` `component`	go_goroutines{cluster_name="$cluster", component="xcp"}

Edge specific gorountines

This shows the number of active goroutines in XCP Edge that are responsible for config translation.

Metric Name	Labels	PromQL Expression
`xcp_edge_go_routine_count_total`	`cluster_name`	increase(xcp_edge_go_routine_count_total{cluster_name="$cluster"}[1m])

Edge CPU consumption

Metric Name	Labels	PromQL Expression
`process_cpu_seconds_total`	`cluster_name` `job`	rate(process_cpu_seconds_total{job="edge-xcp",cluster_name="$cluster"}[1m])

Edge memory consumption

Metric Name	Labels	PromQL Expression
`go_memstats_heap_inuse_bytes`	`cluster_name` `component`	go_memstats_heap_inuse_bytes{component="xcp",cluster_name="$cluster"}
`go_memstats_stack_inuse_bytes`	`cluster_name` `component`	go_memstats_stack_inuse_bytes{component="xcp",cluster_name="$cluster"}

Custom Resource events[5m]

This panel represents the increase in custom resource events received by the edge registry controller in the last 5 minutes.

Metric Name	Labels	PromQL Expression
`xcp_edge_registry_kubernetes_custom_resource_events_total`	`cluster_name`	sum by(kind) (increase(xcp_edge_registry_kubernetes_custom_resource_events_total{cluster_name="$cluster"}[5m]))

EDS Update events[5m]

This Panel represents the increase in EDS update events received by the Edge registry controller in the last 5 minutes.

Metric Name	Labels	PromQL Expression
`xcp_edge_registry_kubernetes_eds_update_events_received_total`	`cluster_name`	sum(rate(xcp_edge_registry_kubernetes_eds_update_events_received_total{cluster_name="$cluster"}[5m]))

Namespace Events[5m]

This panel represents the increase in namespace events received by the Edge in the last 5-minute interval. Edge responds to the namespace events through its namespace controller.

Metric Name	Labels	PromQL Expression
`xcp_edge_registry_kubernetes_namespace_events_received_total`	`cluster_name`	sum by(event_type) (increase(xcp_edge_registry_kubernetes_namespace_events_received_total{cluster_name="$cluster"}[5m]))

Node Events received[5m]

This panel represents the increase in node events received by the Edge Kubernetes registry controller in the last 5 minutes.

There are two different sources for node events. Edge responds differently for different node event sources:

Node Controller: Used for Gateway Hold webhook
XDS Updater Config Update: Used to update node port service addresses

Metric Name	Labels	PromQL Expression
`xcp_edge_registry_kubernetes_node_events_received_total`	`cluster_name`	sum by(node_event_source) (increase(xcp_edge_registry_kubernetes_node_events_received_total{cluster_name="$cluster"}[5m]))

SvcUpdate events[5m]

This panel represents the Svc Update events received by the edge Kubernetes registry controller in 5-minute intervals.

Metric Name	Labels	PromQL Expression
`xcp_edge_registry_kubernetes_svc_update_events_received_total`	`cluster_name`	sum(increase(xcp_edge_registry_kubernetes_svc_update_events_received_total{cluster_name="$cluster"}[5m]))

Service Entry events received[5m]

This Panel represents the increase in Service entry events received by the Edge Kubernetes registry controller in the last 5 minutes.

Metric Name	Labels	PromQL Expression
`xcp_edge_registry_kubernetes_service_entry_events_received_total`	`cluster_name`	sum by(event_type) (increase(xcp_edge_registry_kubernetes_service_entry_events_received_total{cluster_name="$cluster"}[5m]))

Config Translation Attempts and Failures [per aggregation interval]

Total attempts to translate XCP configuration resources into Istio resources, and the subset that failed, over the panel's aggregation interval.

Metric Name Labels PromQL Expression

xcp_edge_config_translation_attempts_failed_total

cluster_name

-sum(rate(xcp_edge_config_translation_attempts_failed_total{cluster_name="$cluster"}[1m]))*1m or vector(0)

xcp_edge_config_translation_attempts_total

cluster_name

sum(rate(xcp_edge_config_translation_attempts_total{cluster_name="$cluster"}[1m]))*1m or vector(0)

Resources present in the cluster but unmanaged by XCP (by API Group and Kind)

Current number of resources present in the cluster but not managed by XCP, broken down by API group and kind. A non-zero value means another producer (e.g. a human, another controller, or remnants from a previous install) is creating Istio config alongside XCP, which can cause drift or surprise overrides.

Metric Name	Labels	PromQL Expression
`xcp_edge_unmanaged_resources_count`	`cluster_name`	sum(xcp_edge_unmanaged_resources_count{cluster_name="${cluster}"}) by (group, kind)

Active Connections from XCP Edge to XCP Central

Current number of active connections from edge to central, grouped by connection type

Metric Name	Labels	PromQL Expression
`xcp_edge_cluster_state_client_streams_open_count`	`cluster_name`	sum(xcp_edge_cluster_state_client_streams_open_count{cluster_name="${cluster}"}) or vector(0)
`xcp_edge_config_status_client_streams_open_count`	`cluster_name`	sum(xcp_edge_config_status_client_streams_open_count{cluster_name="${cluster}"}) or vector(0)
`xcp_edge_diagnostic_channel_client_open_channel_websocket_connections_active_count`	`cluster_name`	sum(xcp_edge_diagnostic_channel_client_open_channel_websocket_connections_active_count{cluster_name="${cluster}"}) or vector(0)
`xcp_edge_resource_exchange_client_streams_open_count`	`cluster_name`	sum(xcp_edge_resource_exchange_client_streams_open_count{cluster_name="${cluster}"}) or vector(0)

Number of reconnects from XCP Edge to XCP Central [per aggregation interval]

Number of times each long-lived gRPC stream from XCP Edge to XCP Central has been (re)opened over the panel's aggregation interval. A persistently high value indicates instability in the Edge↔Central link for that stream type.

Metric Name	Labels	PromQL Expression
`xcp_edge_cluster_state_client_streams_total`	`cluster_name`	sum(rate(xcp_edge_cluster_state_client_streams_total{cluster_name="${cluster}"}[1m]))*1m or vector(0)
`xcp_edge_config_status_client_streams_total`	`cluster_name`	sum(rate(xcp_edge_config_status_client_streams_total{cluster_name="${cluster}"}[1m]))*1m or vector(0)
`xcp_edge_diagnostic_channel_client_open_channel_websocket_connections_total`	`cluster_name`	sum(rate(xcp_edge_diagnostic_channel_client_open_channel_websocket_connections_total{cluster_name="${cluster}"}[1m]))*1m or vector(0)
`xcp_edge_resource_exchange_client_streams_total`	`cluster_name`	sum(rate(xcp_edge_resource_exchange_client_streams_total{cluster_name="${cluster}"}[1m]))*1m or vector(0)

XCP Edge Operator Feature Flag (Boolean)

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_feature_enabled`	`cluster_name`	max by (name, cluster_name) ( xcp_edge_operator_feature_enabled{cluster_name="$cluster"} )

XCP Edge Operator Feature Flag (Numeric)

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_feature_value`	`cluster_name`	max by (name, cluster_name) ( xcp_edge_operator_feature_value{cluster_name="$cluster"} )

XCP Edge Operator Feature Flag (Seconds)

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_feature_value_seconds`	`cluster_name`	max by (name, cluster_name) ( xcp_edge_operator_feature_value_seconds{cluster_name="$cluster"} )

XCP Edge Operator Feature Flag (String)

Metric Name	Labels	PromQL Expression
`xcp_edge_operator_feature_value_string`	`cluster_name`	xcp_edge_operator_feature_value_string{cluster_name="$cluster"}

Control Plane Mode
Closest Token To Expire
Clusters Not Validating Tokens
Valid Tokens
Token Rotation Executions
Token Rotation Execution Failed
Tokens Exceeded Rotation Time
Tokens Rotated Successfully.
Failed Tokens to Rotate.
Tokens Exceeded Rotation Timeline
Valid Tokens Timeline
Invalid Tokens
Postgres Scrape Status
Postgres Scrape Status
Postgres UP
Kubegres Reconciliation Health
Kubegres Reconciliation Latency P95
Kubegres Work Queue Status
Current Replication Lag
Max Replication Lag [s]
Active Replication Slots
Inactive Replication Slots
Connections used
Connections used
Failed WAL archiving attempts
DBs Size
Table Size
Checkpoint sync time
WAL segments size
Buffer hits percentage by table
Disk block reads by table
Success Rate by Upstream Cluster
Error Rate by Upstream Cluster
Request Latency by Upstream Cluster
Number of Downstream flow-control paused events that have NOT yet resumed reading
Number of upstream flow-control paused events that have NOT yet resumed reading
Number of upstream flow-control backup events that have NOT yet drained
Upstream Request Timeouts and Pending Overflows
HTTP/2 Pending Send Bytes by Upstream Cluster
Active Downstream Connections
Downstream Connection Rate
Active Upstream Connections by Cluster
Healthy Endpoints per Cluster
Upstream Connection Failures and Timeouts
Upstream Connections Destroyed with Active Requests
GitOps Status
Accepted Admission Requests
Rejected Admission Requests
Admission Review Latency
Resources Pushed to TSB
Failed pushes to TSB
Resources Conversions
Resources conversions errors
Connected Clusters
TSB Error Rate (Humans)
Istio-Envoy Sync Time (P99)
XCP central -> edge Sync Time (P99)
Istiod Errors
1. Time for config to be visible by XCP Central (P99)
2. Time it takes to send the configs from MP to the workload clusters (P99)
3. Istio Config generation time (P99)
4. Proxy convergence time (P99)
Connected Proxies
Total Error Rate
Median Proxy Convergence Time
Services Known
VirtualServices
Build & Version Info
Istiod Pods
Connected Proxies by Instance
Istiod CPU Usage (cores)
Istiod Memory Usage
Root Cert Expires In
CSR Rate & Cert Issuance Rate
SDS Certificate Errors
Sidecar Injection Rate
Injection Latency (p50 / p95 / p99)
Validation & Webhook
Endpoint Health
Config Conflicts
Applied Envoy Filters
Errored Envoy Filters
Envoy Filter status
Config Events (pilot_k8s_cfg_events)
Registry Events (pilot_k8s_reg_events)
Inbound Updates (pilot_inbound_updates)
Push Triggers by Reason
Debounce Time (p50 / p95 / p99)
PushContext Init Time (p50 / p95 / p99)
Proxy Queue Time (p50 / p95 / p99)
xDS Push Time by Type (p99)
xDS Send Time (p50 / p95 / p99)
End-to-End Proxy Convergence Time (p50 / p95 / p99)
xDS Pushes by Type
Config Size Pushed by Type (avg bytes)
Expired Nonces by Type
Config Size by Type (p99 bytes)
Largest Request Received
Total Config Bytes Pushed / sec
Propagate Config Objects duration
Received configs
Config Processing duration
Received configs by type
Total Conversion Time by Type every 5m
Conversion Time by Type every 5m
Conversions by Resource every 5m
Conversions Invalidations by Resource every 5m
Conversion Invalidation Time every 5m
Updates from TSB every 5m
MPC to XCP pushed configs
MCP to XCP pushed configs error
Config Status updates every 5m
Config Status updates processed every 5m
Config Status stream connection attempts every 5m
TSB Handled Status Reports 5m
Config status cache operations every 5m
TSB Processed Status Reports in 5m
TSB Handling Status Reports Duration 5m
Config status cache operations by event type every 5m
TSB Processing Status Reports Duration 5m
Status Reports Work per Shard Distribution
Work executions every 5m
Status updates worker time every 5m
TSB Status Updates Enqueue Delay 5m
TSB Status Updates Enqueue Delay 5m
TSB Status Updates Worker Delay 5m
TSB Status Updates Worker Delay 5m
Cluster Status Update from XCP every 5m
Cluster updates from XCP processed every 5m
XCP cluster status updates Sent to TSB every 5m
Cluster status updates to TSB stream connection attempts every 5m
Cluster updates from XCP stream connection attempts every 5m
GC Count by Component in Management Plane
GC Duration by Component in Management Plane
Heap Allocations by Component in Management Plane
Heap Objects by Component in Management Plane
Next GC Target by Component in Management Plane
Heap Utilization Percentage by Component in Management Plane
GC CPU Fraction by Component in Management Plane
Goroutines by Component in Management Plane
Heap Sys by Component in Management Plane
gRPC Server Calls Started Rate
gRPC Server Handled Rate
gRPC Client Calls Started Rate
gRPC Client Handled Rate
gRPC Server Handled Status Rate
gRPC Client Handled Status Rate
gRPC Server Msg Sent Rate
gRPC Client Msg Received Rate
gRPC Client Msg Sent Rate
gRPC Server Msg Received Rate
OAP Request Rate
OAP Request Latency
OAP Aggregation Request Rate
OAP Aggregation Rows
OAP Mesh Analysis Latency
OAP Zipkin Trace Rate
OAP Zipkin Trace Latency
OAP Zipkin Trace Error Rate
JVM Threads
JVM Memory
ACL Computation Duration
ACL Computation Rate
ACL Generation Rate
ACL Client Msg Receive Rate
ACL Client Start/Handled
State Update Duration by Operation
State Update Per Second by Operation
State Update Bytes Received by Operation
DB Client Operation Duration by Command
DB Client Operation Rate by Command
Connection Usage
Connection Open/Max
Connections Waited
Connections Created
Connections Destroyed
Heap Utilization Percentage
Heap Allocations
Heap System
Heap Objects
Stack Memory Used
GC Duration
GC Rate
GC Target
GC CPU Fraction
Processor Limit
Goroutine Count
MPC Health
XCP Central Health
XCP Edge Health
TSB API Health
IAM Health
OAP Health
Front Envoy Health
Front Envoy Success Rate
Front Envoy Error Rate
Front Envoy Latency
TSB Success Rate
TSB Error Rate
Authentication Success Rate
Authentication Error Rate
Authentication Latency
Data Store Operations Rate
Data Store Operations Error Rate
Data Store Operations Latency
Data Store Transaction Rate
Data Store Transaction Error Rate
Data Store Transactions Latency
Active Transactions
Mismatching Transactions dry run mode
In Use Connections
Open Connections/Max Connections
Connections Waited
Time Waiting for Connections
Created Connections
Create Connections Latency
Closed Connections
Idle Connections
Service Registry Operations
Service Registry Operations Duration
PDP Success Rate
PDP Error Rate
PDP Latency
PIP Success Rate
PIP Latency
PIP Error Rate
Active PIP Transactions
Management Plane webhook requests
Control Plane webhooks requests
Management Plane webhook latency
Control Plane webhook latency
Management Plane deletion protection webhook
Control Plane deletion protection webhooks
XCP Central Version
Time since last cluster state received from the edge (seconds)
Time since cluster states were sent to the MPC and Edges clients (seconds)
Time since config resync request is received from the edge (seconds)
Time since config CRs sent to the edge (seconds)
messages received by central from edges in last 5 min
Number of times config CRs sent by central to the edges in last 5m
Config Propagation Latency by Edge
Errors in config push REQUESTS to the edges in last 5 minutes
config push(to the edges) requests enqueued to debouncer in last 5 min
Pending configurations (orphan configs)
Number of connections(cluster state pushing and config pushing)
Pending on reference configurations
validation webhook passed count in last 5 min
New connections per min(cluster state pushing and config pushing)
Connection Push timeouts[5m]
All goroutines
Rate of webhook validation errors
Central memory consumption
Central specific goroutines
Edges' memory consumption
Central CPU consumption
All edges' CPU consumption
XCP Central Coordinator running
XCP Central Leader
XCP Central Followers
XCP Central Coordinator Leader election loop[5m]
XCP Central Primary Relay Streams Total
Number of currently open relay streams at the H/A Primary (Relay Server)
Relay streams rejected by primary relay server
Number of relay streams discontinued by primary relay server
Total cluster states sent by primary relay server(Leader->Follower)[5m]
Cluster States received by primary relay server(Follower->Leader)[5m]
Number of cluster state is sent by primary to Secondary for Different CPs[5m]
Number of cluster states received by primary from secondaries for different CP[5m]
Secondary relay client running
Total number of Second relay client stream
Number of open relay stream by secondary relay client
Number of cluster state updates recvd by secondary relay client from Primary[5m]
Number of cluster states sent by secondary relay client to primary[5m]
Number of cluster state updates recvd by secondary client from primary for different CPs[5m]
Number of cluster state updates sent by secondary relay to primary relay for different CPs[5m]
XCP Central Cross Partition enabled
Total number of relay streams by H/A Cross-Paritition Requestor
Number of open streams by Cross-partition requestor
Number of cluster state updates recvd by H/A Cross-partition requestor[5m]
Number of cluster state updates recvd by H/A Cross partition requestor for different CPs[5m]
Total number of relay streams by H/A Cross-Paritition Responder
Number of open streams by Cross-partition responder
Number of cluster state updates sent by H/A Cross-partition responder[5m]
Number of cluster state updates sent by H/A Cross partition responder for different CPs[5m]
Active Server Streams to XCP Central — Resource Exchange
Active Server Streams to XCP Central — Cluster State
Active Server Streams to XCP Central — Config Status
Number of new server streams to XCP Central [per aggregation interval]
XCP Central Operator Feature Flag (Boolean)
XCP Central Operator Feature Flag (Numeric)
XCP Central Operator Feature Flag (Seconds)
XCP Central Operator Feature Flag (String)
Installed Gateways
Gateways Ignored (Revision Mismatch)
Paused Gateways
Gateways in Dirty State
Gateway Reconcile Rate (Success vs Failure)
Gateway Reconcile Time (p50/p95/p99)
Skipped Reconciliations by Reason
Force Reconcile Triggers
Paused Gateways
Gateways in Dirty State
Time Since Last Gateway Reconcile
XCP Edge Version
Number of gatewayHost exposed
Time since any message sent to central on config stream (seconds)
cluster-state build time percentiles(in secs)
Number of times cluster states sent to central
cluster-state propagation delay percentiles(in secs)
propagated to central in 0-1.5 secs
Number of times config status sent by edge to central in last 5 min
propagated to central in 1.5-2.5 secs
propagated to central in 2.5-4 secs
Length of cluster state event queue
propagated to central in 4-7 secs
propagated to central in 11-15 secs
propagated to central in 7-11 secs
propagated to central in 15-20 secs
propagated to central in 20-30 secs
propagated to central in 30-40 secs
propagated to central in more than 40 secs
Number of times cluster states received by edge from central in last 1 min
config translation duration percentiles(in ms)
Number of times config CRs received by edge from central in last 5 min
Translation count per min
Number of configs created/updated by edge at k8s apiserver every 5 minutes
Number of configs deleted by edge from k8s apiserver every 5 minutes
k8s config apply duration P50 percentiles for each namespace (in ms)
k8s config apply duration P90 percentiles for each namespace (in ms)
k8s config apply duration P99 percentiles for each namespace (in ms)
All goroutines
Edge specific gorountines
Edge CPU consumption
Edge memory consumption
Custom Resource events[5m]
EDS Update events[5m]
Namespace Events[5m]
Node Events received[5m]
SvcUpdate events[5m]
Service Entry events received[5m]
Config Translation Attempts and Failures [per aggregation interval]
Resources present in the cluster but unmanaged by XCP (by API Group and Kind)
Active Connections from XCP Edge to XCP Central
Number of reconnects from XCP Edge to XCP Central [per aggregation interval]
XCP Edge Operator Feature Flag (Boolean)
XCP Edge Operator Feature Flag (Numeric)
XCP Edge Operator Feature Flag (Seconds)
XCP Edge Operator Feature Flag (String)

Control Plane Operator metrics

Control Plane Mode​

Closest Token To Expire​

Clusters Not Validating Tokens​

Valid Tokens​

Token Rotation Executions​

Token Rotation Execution Failed​

Tokens Exceeded Rotation Time​

Tokens Rotated Successfully.​

Failed Tokens to Rotate.​

Tokens Exceeded Rotation Timeline​

Valid Tokens Timeline​

Invalid Tokens​

Embedded Postgres

Postgres Scrape Status​

Postgres Scrape Status​

Postgres UP​

Kubegres Reconciliation Health​

Kubegres Reconciliation Latency P95​

Kubegres Work Queue Status​

Current Replication Lag​

Max Replication Lag [s]​

Active Replication Slots​

Inactive Replication Slots​

Connections used​

Connections used​

Failed WAL archiving attempts​

DBs Size​

Table Size​

Checkpoint sync time​

WAL segments size​

Buffer hits percentage by table​

Disk block reads by table​

Front Envoy Operational Status

Success Rate by Upstream Cluster​

Error Rate by Upstream Cluster​

Request Latency by Upstream Cluster​

Number of Downstream flow-control paused events that have NOT yet resumed reading​

Number of upstream flow-control paused events that have NOT yet resumed reading​

Number of upstream flow-control backup events that have NOT yet drained​

Upstream Request Timeouts and Pending Overflows​

HTTP/2 Pending Send Bytes by Upstream Cluster​

Active Downstream Connections​

Downstream Connection Rate​

Active Upstream Connections by Cluster​

Healthy Endpoints per Cluster​

Upstream Connection Failures and Timeouts​

Upstream Connections Destroyed with Active Requests​

GitOps Operational Status

GitOps Status​

Accepted Admission Requests​

Rejected Admission Requests​

Admission Review Latency​

Resources Pushed to TSB​

Failed pushes to TSB​

Resources Conversions​

Resources conversions errors​

Global Configuration Distribution

Connected Clusters​

TSB Error Rate (Humans)​

Istio-Envoy Sync Time (P99)​

XCP central -> edge Sync Time (P99)​

Istiod Errors​

1. Time for config to be visible by XCP Central (P99)​

2. Time it takes to send the configs from MP to the workload clusters (P99)​

3. Istio Config generation time (P99)​

4. Proxy convergence time (P99)​

Istiod / Pilot Control Plane (MP)

Connected Proxies​

Total Error Rate​

Median Proxy Convergence Time​

Services Known​

VirtualServices​

Build & Version Info​

Istiod Pods​

Connected Proxies by Instance​

Istiod CPU Usage (cores)​

Istiod Memory Usage​

Root Cert Expires In​

CSR Rate & Cert Issuance Rate​

Control Plane Mode

Closest Token To Expire

Clusters Not Validating Tokens

Valid Tokens

Token Rotation Executions

Token Rotation Execution Failed

Tokens Exceeded Rotation Time

Tokens Rotated Successfully.

Failed Tokens to Rotate.

Tokens Exceeded Rotation Timeline

Valid Tokens Timeline

Invalid Tokens

Postgres Scrape Status

Postgres Scrape Status

Postgres UP

Kubegres Reconciliation Health

Kubegres Reconciliation Latency P95

Kubegres Work Queue Status

Current Replication Lag

Max Replication Lag [s]

Active Replication Slots

Inactive Replication Slots

Connections used

Connections used

Failed WAL archiving attempts

DBs Size

Table Size

Checkpoint sync time

WAL segments size

Buffer hits percentage by table

Disk block reads by table

Success Rate by Upstream Cluster

Error Rate by Upstream Cluster

Request Latency by Upstream Cluster

Number of Downstream flow-control paused events that have NOT yet resumed reading

Number of upstream flow-control paused events that have NOT yet resumed reading

Number of upstream flow-control backup events that have NOT yet drained

Upstream Request Timeouts and Pending Overflows

HTTP/2 Pending Send Bytes by Upstream Cluster

Active Downstream Connections

Downstream Connection Rate

Active Upstream Connections by Cluster

Healthy Endpoints per Cluster

Upstream Connection Failures and Timeouts

Upstream Connections Destroyed with Active Requests

GitOps Status

Accepted Admission Requests

Rejected Admission Requests

Admission Review Latency

Resources Pushed to TSB

Failed pushes to TSB

Resources Conversions

Resources conversions errors

Connected Clusters

TSB Error Rate (Humans)

Istio-Envoy Sync Time (P99)

XCP central -> edge Sync Time (P99)

Istiod Errors

1. Time for config to be visible by XCP Central (P99)

2. Time it takes to send the configs from MP to the workload clusters (P99)

3. Istio Config generation time (P99)

4. Proxy convergence time (P99)

Connected Proxies

Total Error Rate

Median Proxy Convergence Time

Services Known

VirtualServices

Build & Version Info

Istiod Pods

Connected Proxies by Instance

Istiod CPU Usage (cores)

Istiod Memory Usage

Root Cert Expires In

CSR Rate & Cert Issuance Rate

SDS Certificate Errors

Sidecar Injection Rate

Injection Latency (p50 / p95 / p99)

Validation & Webhook

Endpoint Health

Config Conflicts