Graceful Connection Drain of istio-proxy
This document explains what happens when a pod which has istio-proxy
sidecar enabled is deleted, particularly how the connections are treated, and how smooth you can configure the sidecar to drain the inflight connections gracefully.
This document only applies to TSB version <= 1.4.x
.
Before you get started, make sure you:
✓ Familiarize yourself with TSB concepts
✓ Install the TSB environment. You can use TSB demo for quick install
✓ Completed TSB usage quickstart. This document assumes you already created Tenant and are familiar with Workspace and Config Groups. Also you need to configure tctl to your TSB environment
✓ Install httpbin
When you issue a delete request against a pod in your Kubernetes cluster, all containers within the pod are sent a SIGTERM. If the pod contains only a single container, it will receive a SIGTERM and go into the terminating state.
However, if the pod contains a sidecar (in our case an istio-proxy
sidecar), then it is not automatically guaranteed that the main application is terminated before the sidecar.
If the istio-proxy
sidecar is terminated before the application, the following issues may occur:
- All TCP connections (both inbound and outbound) are terminated abruptly.
- Any connections from the application fail
While there is a proposed KEP for it, currently there is no straightforward way to tell Kubernetes to terminate the application before the sidecar.
However, it is possible to workaround this problem by configuring the drainDuration
parameter. This configuration parameter controls the amount of time that the underlying envoy
proxy drains inflight connections before fully terminating.
To take advantage of the drainDuration
parameter, you will need to configure it in both the container sidecars, and the TSB gateways.
Configuring drainDuration
time for istio-proxy
containers
You will need to apply an overlay to the ControlPlane
CR or Helm values to set drainDuration
. Consider the following example. Note that only applicable parts are shown -- you will most likely need a lot more configuration for your control plane.
spec:
...
components:
istio:
kubeSpec:
overlays:
- apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
name: tsb-istiocontrolplane
patches:
- path: spec.meshConfig.defaultConfig.drainDuration
value: 50s
...
After adding the overlay to your configuration, use the kubectl
command to apply it to the ControlPlane
CR:
kubectl apply -f controlplane.yaml
If you use Helm, you can update spec
section of the control plane Helm values then do helm upgrade
.
Verifying the drainDuration
You must restart of the workload with the istio-proxy
to get the drainDuration
in effect. Once you have restarted your workload, you can verify it by checking the config dump of the for envoy
:
kubectl exec helloworld-v1-59fdd6b476-pjrtr -n helloworld -c istio-proxy -- pilot-agent request GET config_dump |grep -i drainDuration
"drainDuration": "50s",
Configuring drainDuration
for TSB gateways
If you are using TSB gateways such as IngressGateway
, EgressGateway
, or Tier1Gateway
, you will need to configure your appropriate gateway type using the connectionDrainDuration
parameter.
You can query the current value for the connectionDrainDuration
field on your gateway custom resource by issuing the following command:
kubectl get ingress helloworld-gateway -n helloworld -oyaml | grep connectionDrainDuration:
connectionDrainDuration: 22s
The following example shows how connectionDrainDuration
may be set. Please read the spec for further information on the this field.
apiVersion: install.tetrate.io/v1alpha1
kind: IngressGateway
metadata:
name: helloworld-gateway
spec:
connectionDrainDuration: 10s
# ... <snip> ...
Verifying the drainDuration
in the TSB Gateway
To check the value for drainDuration
that is being set on the pod, you can query the environment variable:
kubectl describe po helloworld-gateway-7d5d4c8d57-msfd6 -n helloworld | grep -i DRAIN
TERMINATION_DRAIN_DURATION_SECONDS: 22
You can also verify this value by looking at the logs for the gateway pod when you terminate the gateway. If you watch the logs as the gateway pod is terminated, you should see messages resembling the following:
2022-03-29T06:02:50.423789Z info Graceful termination period is 22s, starting...
2022-03-29T06:03:12.423988Z info Graceful termination period complete, terminating remaining proxies.