Skip to main content
logoTetrate Service BridgeVersion: next

Managing Gateways during Upgrades

When you upgrade your TSB ControlPlane, the new operator may regenerate gateway manifests with updated values — a new proxy image, new environment variables, a new injection template. Applying those changes restarts the gateway pods, and during a large upgrade many gateways restarting at once can mean unplanned downtime.

Starting in TSB 1.14.x, Platform Operators have three controls to manage this:

ControlWhat it does
gatewayReconciliation in the ControlPlane CRPauses reconciliation for an entire revision, or for specific namespaces.
xcp.tetrate.io/gateway-reconcile label on a gateway install CRPauses (or explicitly enables) reconciliation for a single gateway. Highest precedence.
/debug/gateway-reconcile-diff endpoint on xcp-operator-edgeDry-run preview of exactly what would change for each gateway if it were reconciled.

Together they let you pause gateways before an upgrade, preview what the upgrade would change, and release each batch of gateways into a maintenance window of your choosing.

What "reconciliation" means here

A gateway is reconciled when the operator (re)applies its generated manifests. A reconcile only restarts the gateway's pods when the new manifests change the pod template — for example, a different proxy image or new environment variables.

Follow these steps for every cluster where you want to control when gateways restart.

  1. Pause gateway reconciliation before the upgrade

    You have two ways to pause reconciliation. Use whichever fits the scope you need:

    • The gatewayReconciliation setting in the ControlPlane CR is the bulk control. Pause an entire revision in one edit, and selectively allow specific namespaces through (for example, a staging namespace where you want to validate the upgrade first).
    • The xcp.tetrate.io/gateway-reconcile label on a single gateway is the surgical override. It always wins over the API setting — handy for protecting one business-critical gateway, or for un-pausing one gateway out of a namespace that is otherwise paused.

    The example below pauses every gateway under the default revision except those in envoy-staging and envoy-test:

    kubectl edit -n istio-system controlplane controlplane
    apiVersion: install.tetrate.io/v1alpha1
    kind: ControlPlane
    metadata:
    name: controlplane
    namespace: istio-system
    spec:
    xcp:
    isolationBoundaries:
    - name: global
    revisions:
    - name: default
    istio:
    tsbVersion: 1.14.1
    # Pause all gateways in this revision during the upgrade
    gatewayReconciliation:
    enabled: false
    namespaceOverrides:
    # Allow staging/test namespaces to reconcile for canary verification
    - namespace: envoy-staging
    enabled: true
    - namespace: envoy-test
    enabled: true

    To pause an individual gateway, label its install CR:

    Pause one gateway by name
    kubectl label gateways.install.tetrate.io -n bookinfo bookinfo-gw \
    xcp.tetrate.io/gateway-reconcile=false --overwrite

    Confirm the gateways are paused:

    List gateway phases
    kubectl get gatewaydeployments -A \
    -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:.status.phase'

    # Expected output: gateways covered by your pause settings show RECONCILIATION_PAUSED.
    # NAMESPACE NAME PHASE
    # bookinfo bookinfo-gw RECONCILIATION_PAUSED
    # istio-system tier1-gw RECONCILIATION_PAUSED
    # envoy-staging staging-gw READY
  2. Upgrade the ControlPlane

    Follow your normal TSB upgrade procedure. Gateways that are paused will continue running on their current configuration and will not be modified.

  3. Preview what the upgrade would change

    Before releasing any gateway, query the dry-run diff endpoint to see exactly what reconciliation would do. The endpoint is read-only — it never modifies cluster state.

    Port-forward to the edge operator:

    Port-forward to xcp-operator-edge
    kubectl port-forward -n istio-system deployment/xcp-operator-edge 8090:8090

    Then query the diff endpoint:

    http://localhost:8090/debug/gateway-reconcile-diff

    The response is a JSON document with one entry per gateway in the cluster. Each entry reports whether the gateway has pending changes (hasChanges) and whether applying them would restart the pods (willCauseRestart):

    {
    "gateways": [
    {
    "name": "bookinfo-gw",
    "namespace": "bookinfo",
    "kind": "GatewayDeployment",
    "revision": "default",
    "reconcileEnabled": false,
    "reconcileDisabledReason": "RevisionApiDisabled",
    "diff": {
    "deployment": {
    "hasChanges": true,
    "summary": "pod template changed (will cause restart): ...",
    "willCauseRestart": true
    },
    "service": { "hasChanges": false },
    "serviceAccount": { "hasChanges": false },
    "hpa": { "hasChanges": false }
    }
    }
    ],
    "summary": {
    "total": 47,
    "withChanges": 3,
    "willCauseRestart": 2,
    "paused": 45,
    "enabled": 2
    }
    }

    Use jq to quickly list the gateways that would restart:

    List gateways whose pods would restart on reconcile
    curl -s "http://localhost:8090/debug/gateway-reconcile-diff" \
    | jq -r '.gateways[]
    | select([.diff // {} | .[]? | select(. != null) | .willCauseRestart == true] | any)
    | "\(.namespace)/\(.name)"'

    # bookinfo/bookinfo-gw
    # istio-system/tier1-gw

    Or get a per-resource summary of the changes:

    Per-resource change summary
    curl -s "http://localhost:8090/debug/gateway-reconcile-diff" \
    | jq -r '.gateways[] | . as $g | (.diff // {}) | to_entries[]
    | select(.value != null and .value.hasChanges == true)
    | "\($g.namespace)/\($g.name)\t\(.key)\t\(.value.summary // "")"'
    Slow on large clusters?

    On clusters with many gateways the cluster-wide diff can take time. Increase the worker concurrency:

    http://localhost:8090/debug/gateway-reconcile-diff?concurrency=5

    The endpoint is for debugging only — for very large responses, consider raising the memory limit on the xcp-operator-edge pod before requesting a cluster-wide diff.

  4. Release gateways during your maintenance window

    When you are ready to allow gateways to update, release them in batches so you can validate each batch before moving on.

    Release a namespace by adding (or updating) an override in the ControlPlane CR:

    gatewayReconciliation:
    enabled: false
    namespaceOverrides:
    - namespace: envoy-staging
    enabled: true
    - namespace: envoy-prod-us-east # newly released batch
    enabled: true

    Release a single gateway that you paused via the label by removing it:

    kubectl label gateways.install.tetrate.io -n bookinfo bookinfo-gw \
    xcp.tetrate.io/gateway-reconcile-

    Release everything by re-enabling reconciliation globally:

    gatewayReconciliation:
    enabled: true
    # namespaceOverrides no longer needed; can be removed

    Watch the gateway phases transition back to READY as each batch is released:

    kubectl get gatewaydeployments -A \
    -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:.status.phase'

    If any released gateway reports RECONCILIATION_DIRTY instead of READY, see Troubleshooting: a gateway is stuck in RECONCILIATION_DIRTY.

  5. Confirm gateways are healthy and on the new proxy version

    After release, check the rollout completed and all pods are running. In the kubectl get deployments output, the READY column should show n/n (every replica ready); in kubectl get pods, every pod should be Running and 1/1.

    kubectl get deployments -n bookinfo
    kubectl get pods -n bookinfo

    Then confirm the istio-proxy container is running the image that ships with the new ControlPlane. Each gateway's deployment carries an app: tsb-gateway-... label whose exact value depends on the gateway and namespace — discover it with kubectl get deploy -n <namespace> --show-labels, then use it to select the pods:

    Show the running proxy image for a gateway
    kubectl get pods -n bookinfo -l app=tsb-gateway-bookinfo \
    -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[?(@.name=="istio-proxy")].image}{"\n"}{end}'

    # bookinfo-gw-7c8b9d5f4-abc12 docker.io/istio/proxyv2:1.24.4
    # bookinfo-gw-7c8b9d5f4-def34 docker.io/istio/proxyv2:1.24.4
    Gateways with no pod-template change won't restart on their own

    If a TSB upgrade does not change a gateway's pod template — for example, the proxy image tag is unchanged — the gateway will be marked reconciled but its pods will keep running the previous image until you restart them yourself.

    To pick up a new image on those gateways, do a rolling restart of the underlying Kubernetes Deployment during your maintenance window:

    Force-restart a gateway's pods
    kubectl rollout restart deployment <gateway-deployment-name> -n <namespace>

Troubleshooting: a gateway is stuck in RECONCILIATION_DIRTY

A released gateway normally moves from RECONCILIATION_PAUSED back to READY once the operator reconciles it. RECONCILIATION_DIRTY means the operator detected pending changes for the gateway but the regular reconcile loop has not applied them — the gateway keeps running on its existing configuration in the meantime.

Inspect the DirtyStateDetected condition to see which resources differ:

kubectl describe gatewaydeployment <name> -n <namespace>

To apply the pending changes, force-reconcile the gateway by setting the install.tetrate.io/reconcile-before label to a near-future timestamp. The value is a compact ISO 8601 UTC timestamp in the format YYYYMMDDThhmmssZ — for example, 20260601T120000Z means 1 June 2026, 12:00 UTC.

Force a reconcile within the next hour
kubectl label gateways.install.tetrate.io <name> -n <namespace> \
install.tetrate.io/reconcile-before=20260601T120000Z --overwrite

While the current time is before the timestamp, the operator performs a full reconcile and restarts the pods if the pod template changed. Once the timestamp passes, the label has no further effect — it is safe to leave in place, including in GitOps-managed manifests, and does not affect future upgrades.

Confirm the gateway is back to READY:

kubectl get gatewaydeployment <name> -n <namespace> -o jsonpath='{.status.phase}'

Reference

Gateway phases

GatewayDeployment.status.phase reports the overall state of a gateway:

PhaseMeaning
READYThe gateway is reconciled and the workload is healthy.
PENDINGThe gateway is being created or its workload is still coming up.
WAITING_FOR_LOAD_BALANCERThe service has been created and is waiting for the cloud load balancer to be assigned.
RECONCILIATION_PAUSEDReconciliation is paused for this gateway (by API setting or label). The workload continues running on its existing configuration.
RECONCILIATION_DIRTYThe operator detected pending changes for the gateway but the regular reconcile loop has not applied them. See Troubleshooting: a gateway is stuck in RECONCILIATION_DIRTY.
TRANSLATION_FAILEDTSB could not translate the gateway's configuration into Istio resources.

Pause precedence

When multiple controls disagree, the operator resolves them in this order (highest to lowest):

  1. Object label on the gateway install CR (xcp.tetrate.io/gateway-reconcile: "true"|"false").
  2. Namespace override in gatewayReconciliation.namespaceOverrides, matched against the gateway's application namespace.
  3. Revision-level setting gatewayReconciliation.enabled.
  4. Default: reconciliation is enabled.

Pause reasons

When a gateway is paused, the ReconciliationPaused condition reports the reason:

ReasonSourceHow to release
ObjectLabelDisabledThe xcp.tetrate.io/gateway-reconcile=false label on the gateway install CRRemove the label, or set it to true
NamespaceApiDisabledA namespaceOverrides entry with enabled: falseRemove or update the override in the ControlPlane CR
RevisionApiDisabledgatewayReconciliation.enabled: false at the revision levelSet enabled: true, or add a namespace override

Inspect the condition on a gateway:

kubectl get gatewaydeployment <name> -n <namespace> -o yaml
status:
phase: RECONCILIATION_PAUSED
conditions:
- type: ReconciliationPaused
status: "True"
reason: ObjectLabelDisabled
message: 'Reconciliation paused: ObjectLabelDisabled'
lastTransitionTime: "2026-03-23T16:45:51Z"

Observability

The edge operator exposes Prometheus metrics for gateway reconciliation. Scrape them from the operator service or port-forward locally:

kubectl port-forward -n istio-system service/xcp-operator-edge 8084:8080
curl -s http://localhost:8084/metrics | grep gateway_reconcile

All gateway metrics carry the labels gateway_type, gateway_namespace, and gateway_name. The most useful ones for upgrade workflows:

MetricTypeDescription
gateway_reconcile_pausedGauge1 when a gateway is currently paused, 0 when active.
gateway_reconcile_skipped_totalCounterIncremented each time a gateway reconcile is skipped because of pause settings. The reason label is one of object_label_disabled, namespace_api_disabled, revision_api_disabled.
gateway_force_reconcile_totalCounterIncremented each time the install.tetrate.io/reconcile-before label triggers a force-reconcile (used to unstick a RECONCILIATION_DIRTY gateway).

Example alert rules:

AlertPromQLForSeverity
GatewayReconcilePausedTooLonggateway_reconcile_paused == 17dWarning — a pause was likely forgotten after an upgrade.
AllGatewaysFrozencount(gateway_reconcile_paused == 1) == count(gateway_reconcile_paused)1hWarning — every gateway is paused; no gateway is being actively managed.