Managing Gateways during Upgrades
When you upgrade your TSB ControlPlane, the new operator may regenerate gateway manifests with updated values — a new proxy image, new environment variables, a new injection template. Applying those changes restarts the gateway pods, and during a large upgrade many gateways restarting at once can mean unplanned downtime.
Starting in TSB 1.14.x, Platform Operators have three controls to manage this:
| Control | What it does |
|---|---|
gatewayReconciliation in the ControlPlane CR | Pauses reconciliation for an entire revision, or for specific namespaces. |
xcp.tetrate.io/gateway-reconcile label on a gateway install CR | Pauses (or explicitly enables) reconciliation for a single gateway. Highest precedence. |
/debug/gateway-reconcile-diff endpoint on xcp-operator-edge | Dry-run preview of exactly what would change for each gateway if it were reconciled. |
Together they let you pause gateways before an upgrade, preview what the upgrade would change, and release each batch of gateways into a maintenance window of your choosing.
A gateway is reconciled when the operator (re)applies its generated manifests. A reconcile only restarts the gateway's pods when the new manifests change the pod template — for example, a different proxy image or new environment variables.
Recommended Upgrade Workflow
Follow these steps for every cluster where you want to control when gateways restart.
Pause gateway reconciliation before the upgrade
You have two ways to pause reconciliation. Use whichever fits the scope you need:
- The
gatewayReconciliationsetting in theControlPlaneCR is the bulk control. Pause an entire revision in one edit, and selectively allow specific namespaces through (for example, a staging namespace where you want to validate the upgrade first). - The
xcp.tetrate.io/gateway-reconcilelabel on a single gateway is the surgical override. It always wins over the API setting — handy for protecting one business-critical gateway, or for un-pausing one gateway out of a namespace that is otherwise paused.
The example below pauses every gateway under the
defaultrevision except those inenvoy-stagingandenvoy-test:kubectl edit -n istio-system controlplane controlplaneapiVersion: install.tetrate.io/v1alpha1
kind: ControlPlane
metadata:
name: controlplane
namespace: istio-system
spec:
xcp:
isolationBoundaries:
- name: global
revisions:
- name: default
istio:
tsbVersion: 1.14.1
# Pause all gateways in this revision during the upgrade
gatewayReconciliation:
enabled: false
namespaceOverrides:
# Allow staging/test namespaces to reconcile for canary verification
- namespace: envoy-staging
enabled: true
- namespace: envoy-test
enabled: trueTo pause an individual gateway, label its install CR:
Pause one gateway by namekubectl label gateways.install.tetrate.io -n bookinfo bookinfo-gw \
xcp.tetrate.io/gateway-reconcile=false --overwriteConfirm the gateways are paused:
List gateway phaseskubectl get gatewaydeployments -A \
-o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:.status.phase'
# Expected output: gateways covered by your pause settings show RECONCILIATION_PAUSED.
# NAMESPACE NAME PHASE
# bookinfo bookinfo-gw RECONCILIATION_PAUSED
# istio-system tier1-gw RECONCILIATION_PAUSED
# envoy-staging staging-gw READY- The
Upgrade the ControlPlane
Follow your normal TSB upgrade procedure. Gateways that are paused will continue running on their current configuration and will not be modified.
Preview what the upgrade would change
Before releasing any gateway, query the dry-run diff endpoint to see exactly what reconciliation would do. The endpoint is read-only — it never modifies cluster state.
Port-forward to the edge operator:
Port-forward to xcp-operator-edgekubectl port-forward -n istio-system deployment/xcp-operator-edge 8090:8090Then query the diff endpoint:
http://localhost:8090/debug/gateway-reconcile-diffThe response is a JSON document with one entry per gateway in the cluster. Each entry reports whether the gateway has pending changes (
hasChanges) and whether applying them would restart the pods (willCauseRestart):{
"gateways": [
{
"name": "bookinfo-gw",
"namespace": "bookinfo",
"kind": "GatewayDeployment",
"revision": "default",
"reconcileEnabled": false,
"reconcileDisabledReason": "RevisionApiDisabled",
"diff": {
"deployment": {
"hasChanges": true,
"summary": "pod template changed (will cause restart): ...",
"willCauseRestart": true
},
"service": { "hasChanges": false },
"serviceAccount": { "hasChanges": false },
"hpa": { "hasChanges": false }
}
}
],
"summary": {
"total": 47,
"withChanges": 3,
"willCauseRestart": 2,
"paused": 45,
"enabled": 2
}
}Use
jqto quickly list the gateways that would restart:List gateways whose pods would restart on reconcilecurl -s "http://localhost:8090/debug/gateway-reconcile-diff" \
| jq -r '.gateways[]
| select([.diff // {} | .[]? | select(. != null) | .willCauseRestart == true] | any)
| "\(.namespace)/\(.name)"'
# bookinfo/bookinfo-gw
# istio-system/tier1-gwOr get a per-resource summary of the changes:
Per-resource change summarycurl -s "http://localhost:8090/debug/gateway-reconcile-diff" \
| jq -r '.gateways[] | . as $g | (.diff // {}) | to_entries[]
| select(.value != null and .value.hasChanges == true)
| "\($g.namespace)/\($g.name)\t\(.key)\t\(.value.summary // "")"'Slow on large clusters?On clusters with many gateways the cluster-wide diff can take time. Increase the worker concurrency:
http://localhost:8090/debug/gateway-reconcile-diff?concurrency=5The endpoint is for debugging only — for very large responses, consider raising the memory limit on the
xcp-operator-edgepod before requesting a cluster-wide diff.Release gateways during your maintenance window
When you are ready to allow gateways to update, release them in batches so you can validate each batch before moving on.
Release a namespace by adding (or updating) an override in the
ControlPlaneCR:gatewayReconciliation:
enabled: false
namespaceOverrides:
- namespace: envoy-staging
enabled: true
- namespace: envoy-prod-us-east # newly released batch
enabled: trueRelease a single gateway that you paused via the label by removing it:
kubectl label gateways.install.tetrate.io -n bookinfo bookinfo-gw \
xcp.tetrate.io/gateway-reconcile-Release everything by re-enabling reconciliation globally:
gatewayReconciliation:
enabled: true
# namespaceOverrides no longer needed; can be removedWatch the gateway phases transition back to
READYas each batch is released:kubectl get gatewaydeployments -A \
-o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:.status.phase'If any released gateway reports
RECONCILIATION_DIRTYinstead ofREADY, see Troubleshooting: a gateway is stuck in RECONCILIATION_DIRTY.Confirm gateways are healthy and on the new proxy version
After release, check the rollout completed and all pods are running. In the
kubectl get deploymentsoutput, theREADYcolumn should shown/n(every replica ready); inkubectl get pods, every pod should beRunningand1/1.kubectl get deployments -n bookinfo
kubectl get pods -n bookinfoThen confirm the
istio-proxycontainer is running the image that ships with the new ControlPlane. Each gateway's deployment carries anapp: tsb-gateway-...label whose exact value depends on the gateway and namespace — discover it withkubectl get deploy -n <namespace> --show-labels, then use it to select the pods:Show the running proxy image for a gatewaykubectl get pods -n bookinfo -l app=tsb-gateway-bookinfo \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[?(@.name=="istio-proxy")].image}{"\n"}{end}'
# bookinfo-gw-7c8b9d5f4-abc12 docker.io/istio/proxyv2:1.24.4
# bookinfo-gw-7c8b9d5f4-def34 docker.io/istio/proxyv2:1.24.4Gateways with no pod-template change won't restart on their ownIf a TSB upgrade does not change a gateway's pod template — for example, the proxy image tag is unchanged — the gateway will be marked reconciled but its pods will keep running the previous image until you restart them yourself.
To pick up a new image on those gateways, do a rolling restart of the underlying Kubernetes Deployment during your maintenance window:
Force-restart a gateway's podskubectl rollout restart deployment <gateway-deployment-name> -n <namespace>
Troubleshooting: a gateway is stuck in RECONCILIATION_DIRTY
A released gateway normally moves from RECONCILIATION_PAUSED back to READY once the operator reconciles it. RECONCILIATION_DIRTY means the operator detected pending changes for the gateway but the regular reconcile loop has not applied them — the gateway keeps running on its existing configuration in the meantime.
Inspect the DirtyStateDetected condition to see which resources differ:
kubectl describe gatewaydeployment <name> -n <namespace>
To apply the pending changes, force-reconcile the gateway by setting the install.tetrate.io/reconcile-before label to a near-future timestamp. The value is a compact ISO 8601 UTC timestamp in the format YYYYMMDDThhmmssZ — for example, 20260601T120000Z means 1 June 2026, 12:00 UTC.
kubectl label gateways.install.tetrate.io <name> -n <namespace> \
install.tetrate.io/reconcile-before=20260601T120000Z --overwrite
While the current time is before the timestamp, the operator performs a full reconcile and restarts the pods if the pod template changed. Once the timestamp passes, the label has no further effect — it is safe to leave in place, including in GitOps-managed manifests, and does not affect future upgrades.
Confirm the gateway is back to READY:
kubectl get gatewaydeployment <name> -n <namespace> -o jsonpath='{.status.phase}'
Reference
Gateway phases
GatewayDeployment.status.phase reports the overall state of a gateway:
| Phase | Meaning |
|---|---|
READY | The gateway is reconciled and the workload is healthy. |
PENDING | The gateway is being created or its workload is still coming up. |
WAITING_FOR_LOAD_BALANCER | The service has been created and is waiting for the cloud load balancer to be assigned. |
RECONCILIATION_PAUSED | Reconciliation is paused for this gateway (by API setting or label). The workload continues running on its existing configuration. |
RECONCILIATION_DIRTY | The operator detected pending changes for the gateway but the regular reconcile loop has not applied them. See Troubleshooting: a gateway is stuck in RECONCILIATION_DIRTY. |
TRANSLATION_FAILED | TSB could not translate the gateway's configuration into Istio resources. |
Pause precedence
When multiple controls disagree, the operator resolves them in this order (highest to lowest):
- Object label on the gateway install CR (
xcp.tetrate.io/gateway-reconcile: "true"|"false"). - Namespace override in
gatewayReconciliation.namespaceOverrides, matched against the gateway's application namespace. - Revision-level setting
gatewayReconciliation.enabled. - Default: reconciliation is enabled.
Pause reasons
When a gateway is paused, the ReconciliationPaused condition reports the reason:
| Reason | Source | How to release |
|---|---|---|
ObjectLabelDisabled | The xcp.tetrate.io/gateway-reconcile=false label on the gateway install CR | Remove the label, or set it to true |
NamespaceApiDisabled | A namespaceOverrides entry with enabled: false | Remove or update the override in the ControlPlane CR |
RevisionApiDisabled | gatewayReconciliation.enabled: false at the revision level | Set enabled: true, or add a namespace override |
Inspect the condition on a gateway:
status:
phase: RECONCILIATION_PAUSED
conditions:
- type: ReconciliationPaused
status: "True"
reason: ObjectLabelDisabled
message: 'Reconciliation paused: ObjectLabelDisabled'
lastTransitionTime: "2026-03-23T16:45:51Z"
Observability
The edge operator exposes Prometheus metrics for gateway reconciliation. Scrape them from the operator service or port-forward locally:
kubectl port-forward -n istio-system service/xcp-operator-edge 8084:8080
curl -s http://localhost:8084/metrics | grep gateway_reconcile
All gateway metrics carry the labels gateway_type, gateway_namespace, and gateway_name. The most useful ones for upgrade workflows:
| Metric | Type | Description |
|---|---|---|
gateway_reconcile_paused | Gauge | 1 when a gateway is currently paused, 0 when active. |
gateway_reconcile_skipped_total | Counter | Incremented each time a gateway reconcile is skipped because of pause settings. The reason label is one of object_label_disabled, namespace_api_disabled, revision_api_disabled. |
gateway_force_reconcile_total | Counter | Incremented each time the install.tetrate.io/reconcile-before label triggers a force-reconcile (used to unstick a RECONCILIATION_DIRTY gateway). |
Example alert rules:
| Alert | PromQL | For | Severity |
|---|---|---|---|
GatewayReconcilePausedTooLong | gateway_reconcile_paused == 1 | 7d | Warning — a pause was likely forgotten after an upgrade. |
AllGatewaysFrozen | count(gateway_reconcile_paused == 1) == count(gateway_reconcile_paused) | 1h | Warning — every gateway is paused; no gateway is being actively managed. |