Version: 0.9.x

TSB HA and DR in AWS

This document tries outlines a base setup for TSB HA and DR when installed in AWS. As part of this test scenario we are going to use AWS RDS instance as configuration store and AWS Elasticsearch service as metrics store.

Multi-Availability Zone

The environment setup for this is scenario is as follows:

RDS Postgres instance is configured as multi availability zone.
EKS cluster management plane in 3 availability zone with 3 node pools, each pool in a different availability zone.
AWS Elasticsearch domain configured in multiple availability zone in VPC mode, bound to 3 subnets, each in a different availability zone.
TSB connected to both the RDS Postgres instance and the Elasticsearch domain.

AWS RDS achieves failover for instances via DNS resolution. AWS provides you a DNS record which resolves to the active instance, and when failover occurs, the DNS record is updated to resolve to the new active instance IP address, so connections need to be recreated. In TSB this currently takes less than 1 minute.

AWS Elasticsearch is deployed with an ELB in front, so attaching the ES instances to 3 subnets in 3 availability zones (and so the ELB that provides access to it) provides redundancy so if an entire availability zone is down, the ELB will still reach the other subnets.

TSB

In order to test failover for the RDS instance, AWS provides a reboot action with an option that forces the failover to a different availability zone. When this happens you can can see in tsb server logs something similar to:

2020-02-24T13:00:11.358917Z info transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-02-24T13:00:44.442867Z warn grpc error: error finding user for principal "admin": error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": read tcp 10.0.0.185:37000->10.0.0.98:5432: read: connection reset by peer
...
2020-02-24T13:01:02.875262Z warn grpc error: rpc error: code = Internal desc = error listing environments: read tcp 10.0.0.185:43380->10.0.0.98:5432: read: connection reset by peer
2020-02-24T13:01:02.875262Z warn grpc error: rpc error: code = Internal desc = error listing objects under environment: read tcp 10.0.0.185:49198->10.0.0.98:5432: read: connection reset by peer
2020-02-24T13:01:12.961913Z info
...
2020-02-24T13:02:11.388746Z info
desc = "transport is closing"
2020-02-24T13:02:12.826606Z info
desc = "transport is closing"
2020-02-24T13:02:22.746764Z warn
executing insert audit log statement: read tcp 10.0.0.185:40790->10.0.0.98:5432: read: connection reset by peer
Failed to extract ServerMetadata from context transport: loopyWriter.run returning. connection error: transport: loopyWriter.run returning. connection error: grpc error: rpc error: code = Internal desc = error
2020-02-24T13:02:22.746932Z warn audit error dispatching audit log: rpc error: code = Internal desc = error executing insert audit log statement: read tcp 10.0.0.185:40790->10.0.0.98:5432: read: connection reset by peer
2020-02-24T13:02:22.783330Z info transport: http2Server.HandleStreams failed to read frame: read tcp 10.0.0.185:9080->10.0.1.17:34158: read: connection reset by peer
2020-02-24T13:02:22.783409Z info
desc = "transport is closing"
2020-02-24T13:02:25.845679Z info
2020-02-24T13:02:30.938885Z warn
tenants/chirauki-tenant/environments/dev) = Error(error processing user graph: error processing U(users:tenants/chirauki-tenant/users/tsbd-tcc-dev): error processing candidate node "bind([users:tenants/chirauki-tenant/users/tsbd-tcc-dev][[]] -(role:rbac/admin)-> config:tenants/chirauki-tenant/environments/dev/clusters/tcc-dev/namespaces/bookinfo/deploymen ts/tsb-gateway-bookinfo)": error processing candidate node "rbac-bind([users:tenants/chirauki-tenant/users/tsbd-tcc-dev][[]] -(role:rbac/admin)-> config:tenants/chirauki-tenant/environments/dev/clusters/tcc-dev/namespaces/bookinfo/deploymen ts/tsb-gateway-bookinfo)": error processing node "rbac-bind([users:tenants/chirauki-tenant/users/tsbd-tcc-dev][[]] -(role:rbac/admin)-> config:tenants/chirauki-tenant/environments/dev/clusters/tcc-dev/namespaces/bookinfo/deploymen ts/tsb-gateway-bookinfo)": error processing node "rbac-bind([users:tenants/chirauki-tenant/users/tsbd-tcc-dev][[]] -(role:rbac/admin)-> config:tenants/chirauki-tenant/environments/dev/clusters/tcc-dev/namespaces/bookinfo/deploymen ts/tsb-gateway-bookinfo)": error getting associations of UA(rbac-bind([users:tenants/chirauki-tenant/users/tsbd-tcc-dev][[]] -(role:rbac/admin)-> config:tenants/chirauki-tenant/environments/dev/clusters/tcc-dev/namespaces/bookinfo/deploymen ts/tsb-gateway-bookinfo)): error getting associations for "rbac-bind([users:tenants/chirauki-tenant/users/tsbd-tcc-dev][[]] -(role:rbac/admin)-> config:tenants/chirauki-tenant/environments/dev/clusters/tcc-dev/namespaces/bookinfo/deploymen ts/tsb-gateway-bookinfo)": read tcp 10.0.0.185:35890->10.0.0.98:5432: read: connection reset by peer)
transport: loopyWriter.run returning. connection error:
Failed to extract ServerMetadata from context
q test(tenants/chirauki-tenant/users/tsbd-tcc-dev
2020-02-24T13:02:30.938937Z warn
access denied
2020-02-24T13:02:37.048641Z info
2020-02-24T13:02:39.130545Z warn
error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": dial tcp 10.0.2.31:5432: connect: connection timed out 2020-02-24T13:02:39.130545Z warn grpc error: error finding user for principal "admin": error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": dial tcp 10.0.2.31:5432: connect: connection timed out 2020-02-24T13:02:41.178484Z warn grpc error: error finding user for principal "admin": error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": dial tcp 10.0.2.31:5432: connect: connection timed out
grpc error: rpc error: code = PermissionDenied desc =
Failed to extract ServerMetadata from context
grpc error: error finding user for principal "admin":
2020-02-24T13:02:43.230475Z warn grpc error: error finding user for principal "admin": error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": dial tcp 10.0.2.31:5432: connect: connection timed out 2020-02-24T13:02:43.230485Z warn grpc error: error finding user for principal "admin": error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": dial tcp 10.0.2.31:5432: connect: connection timed out 2020-02-24T13:02:45.274539Z warn grpc error: error finding user for principal "admin": error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": dial tcp 10.0.2.31:5432: connect: connection timed out 2020-02-24T13:02:45.274614Z warn grpc error: error finding user for principal "admin": error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": dial tcp 10.0.2.31:5432: connect: connection timed out 2020-02-24T13:02:45.744783Z info Failed to extract ServerMetadata from context 2020-02-24T13:02:47.326497Z warn grpc error: error finding user for principal "admin": error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": dial tcp 10.0.2.31:5432: connect: connection timed out 2020-02-24T13:02:51.418478Z warn grpc error: error finding user for principal "tsbd-tcc-dev": error looking for user "tsbd-tcc-dev": error getting node "users:tsbd-tcc-dev": error getting node with name "users:tsbd-tcc-dev": dial tcp 10.0.2.31:5432: connect: connection timed out
2020-02-24T13:02:53.047410Z info
2020-02-24T13:03:01.044368Z info
2020-02-24T13:03:05.754541Z warn
"tsbd-tcc-dev": error looking for user "tsbd-tcc-dev": error getting node "users:tsbd-tcc-dev": error getting node with name "users:tsbd-tcc-dev": dial tcp 10.0.2.31:5432: connect: connection timed out
Failed to extract ServerMetadata from context Failed to extract ServerMetadata from context grpc error: error finding user for principal
2020-02-24T13:03:07.802504Z warn grpc error: error finding user for principal "admin": error looking for user "admin": error getting node "users:admin": error getting node with name "users:admin": dial tcp 10.0.2.31:5432: connect: connection timed out 2020-02-24T13:03:09.047788Z info Failed to extract ServerMetadata from context

In the above output, 10.0.2.31 is the original instance IP whereas 10.0.0.98 is the failed over instance IP. TCC pod IP is 10.0.0.185. TSB self heals within 1 min.

TSBD

tsbd uses a persistent volume to store configuration, which by default in EKS are provisioned as EBS volumes. This creates a situation where the EBS volume will be created in the availability zone where the tsbd pod is first scheduled, and since AWS does not allow mounting volumes to instances in a different availability zone than the one where the EBS volume lives, it is not possible for EKS to schedule the tsbd in a different availability zone should an availability zone failure would occur.

AWS EFS on the other hand is available across availability zone, and AWS provides documentation on how to use EFS for backing EKS persistent volumes.

For the configsink-data volume to be provisioned in the AWS EFS storage class you will need to add the following annotation to the configsync PVC:

volume.beta.kubernetes.io/storage-class: aws-efs

Where aws-efs is the name of the storage class that provisions the volumes in EFS.

Disaster recovery - Multi Region

Postgres DR

Postgres DR is achieved by having an active master instance in the main region and a read replica of that instance in the secondary region. All writes to the master instance are replicated to the read replica.

A DNS CNAME record is set up such that we have a name that points to the Postgres RDS endpoint of the master instance. If the master instance crashes in any way, the procedure to perform a recovery would be:

Promote the read replica in the secondary region to master. At this point, the replica becomes a master and stops receiving updates from the instance in the main region.
Update the CNAME in DNS to resolve to the Postgres RDS endpoint of the instance that has just been promoted to master.
TSB will fail in active connections to DB and create new ones to the promoted instance.

Elasticsearch DR

The DNS CNAME approach cannot be used with AWS Elasticsearch and OAP and Zipkin components in the management plane (that is tcc namespace).

The management plane OAP and Zipkin are Java applications. An init container in the pod fetches the CA certificate to validate the HTTPS certificate presented by Elasticsearch endpoint, but if we use a CNAME record, the host name used in the connection will not match the CN of the Elasticsearch certificate and connections will not be made. As such, we will need to configure each TSB control plane to point to local region Elasticsearch.

We will set up a CronJob in the main region EKS cluster that will periodically take a snapshot of the Elasticsearch domain in the master region to a replicated S3 bucket as described in the appendix. We will also set up a CronJob in the secondary region EKS cluster to restore the snapshot taken in the main region to the secondary region Elasticsearch domain.

Because AWS Elasticsearch does not allow close operation on indexes, the oap-deployment and Zipkin deployments in tcc namespace in the secondary region EKS cluster have to be scaled down and scaled up only when failing over.

Multiple TSB management planes

The scenario setup for this tests is as follows:

“Main” region is eu-west-3, “secondary” is eu-west-1.
RDS Postgres instance in main region.
RDS read replica from the above instance in the secondary region.
One EKS cluster in each region.
One Elasticsearch domain in each region.
Route53 private DNS zone holds 2 CNAME records:
Record pointing to TSB front envoy LB in the main region (this can’t be setup until TSB management plane has been deployed).
Record pointing to master RDS instance in the main region.
S3 bucket in region eu-west-3 replicating to another bucket in eu-west-1. This will be used to store Elasticsearch snapshots.
TSB management plane deployed into both EKS clusters, connecting to the RDS instance using the Route53 CNAME record.
TSB control plane deployed in both EKS clusters, connecting to TSB management plane via Route53 CNAME.

Note that this setup requires VPC peering or equivalent to allow communication between the 2 VPCs.

TSB setup

Deploy TSB in the main region, disabling bundled Postgres and supplying the CNAME for the RDS master instance as Postgres host.

After installing TSB, update the CNAME record for TSB API to point to the front envoy load balancer in the main region. Recommendation is to set short TTL so when this name is refreshed to point to the failed over instance, clients will take shorter to refresh.

Deploy TSB in the secondary region, disabling bundled Postgres and supplying the CNAME for the RDS master instance as Postgres host. Onboard the EKS cluster in the secondary region in TSB.

At this point, have a working TSB management plane with the EKS clusters of both main and secondary region onboarded.

Failover

With the described scenario, the failover procedure is as follows.

Delete master RDS instance in the main region (if available).
Promote read RDS replica in secondary region to master.
Update DNS CNAME for TSB API to point to the new front envoy service in the secondary region cluster, and the CNAME for Postgres to the promoted RDS instance in secondary region. In the case that only the EKS cluster failed, update only the TSB CNAME. On the other hand, if only DB failed, update only the Postgres CNAME.
Delete the CronJob to restore Elasticsearch snapshots
Scale up oap-deployment and Zipkin deployments in tcc namespace in the secondary region EKS cluster.

Failback

Once the main region is back online, the failback procedure should be as follows.

Delete RDS instance in the main region (if present).
Depending on the state of the EKS cluster when the main region becomes available, you may need to deploy TSB management plane again. Keep oap-deployment and Zipkin scaled to 0.
Setup a read replica from the current RDS instance to the main region.
Setup the Elasticsearch backup job in the EKS cluster in the secondary region to take a snapshot of the Elasticsearch cluster in that region.
Setup the restore Elasticsearch snapshot job in the EKS cluster in the main region to restore Elasticsearch data.
Promote the RDS instance in the main region to be a master instance.
Update DNS CNAME for TSB API to point to the new front envoy service in the main region cluster, and the CNAME for Postgres to the new RDS instance in the main region.
Delete RDS instance in secondary region and set up a new read replica from RDS instance in the main region.
Delete restore job in main EKS cluster, backup job in secondary EKS cluster.

Single TSB management plane

In this case, deploy the second TSB after the failure in the primary region and use backup approach to restore the state of the second TSB

The scenario setup for this tests is as follows:

RDS Postgres instance in region eu-west-3.
RDS read replica from the above instance in eu-west-1.
One EKS cluster in each region (eu-west-3, eu-west-1).
One Elasticsearch domain in each region (eu-west-3, eu-west-1).
TSB management plane deployed into EKS cluster at eu-west-3, connecting to the RDS instance.
Route53 private DNS zone holds 2 CNAME records:
Record pointing to TSB front envoy LB.
Record pointing to master RDS instance.
S3 bucket in region eu-west-3 replicating to another bucket in eu-west-1. This will be used to store K8S backups and Elasticsearch snapshots.

TSB setup

Once the TSB management plane is deployed, update or create a DBS record that points to the front envoy load balancer. Recommendation is to set short TTL so when this name is refreshed to point to the failed over instance, clients will take shorter to refresh.

In this scenario you will need a way to replicate the Kubernetes objects that form the TSB management plane to the secondary region EKS cluster. There are many third party tools to this end, in this document we describe how to achieve this using Velero.

Install Velero in both clusters:

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.0.1 \
    --bucket ttr8-bck-test \
    --backup-location-config region=eu-west-3 \
    --snapshot-location-config region=eu-west-1 \
    --no-secret

And:

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.0.1 \
    --bucket ttr8-bck-test-repl \
    --backup-location-config region=eu-west-1 \
    --snapshot-location-config region=eu-west-3 \
    --no-secret

Backups in region eu-west-3 will be uploaded to bucket ttr8-bck-test, which replicates the contents to bucket ttr8-bck-test-repl in region eu-west-1. This should make the contents of the backup available even if the whole eu-west-3 region fails. Similarly, Velero takes snapshots of EBS volumes backing physical volumes to the region specified in --snapshot-location-config.

Due to how AWS EKS IAM permissions are granted to pods, we need a couple of extra steps so Velero would work correctly. We need to grant the Velero pods access to AWS resources and we will do this by mapping Kubernetes service accounts to IAM.

kubectl annotate serviceaccounts -n velero velero \
    eks.amazonaws.com/role-arn=[your-s3-access-role-arn]
kubectl patch -n velero deployments/velero --patch \
    '{"spec":{"template":{"spec":{"securityContext":{"fsGroup":65535}}}}}'

Once installed, in the “main” cluster:

velero backup create tsb --include-namespaces tcc --include-cluster-resources

Cluster scoped resources are needed since we have some cluster roles in use. You can verify the progress and status of the backup:

velero backup describe tsb
Name: tsb
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: <none>

Phase: Completed
...

Failover

The failover procedure in case of region failure is described below.

Delete master RDS instance (if available).
Promote read RDS replica to master.
Restore backup in EKS cluster in the secondary region.
Update DNS CNAME to point to the new front envoy service in “secondary” cluster, and promoted RDS instance.

To restore the backup into the secondary cluster:

velero restore create tsb-dr --from-backup tsb --include-cluster-resources

Wait for the restore to finish.

velero restore describe tsb-dr Name: tsb-dr
Namespace: velero
Labels: <none>
Annotations: <none>

Phase: Completed
...

Once the restore has been completed, you will be able to check the new front envoy service address in the tcc namespace. Update Route53 CNAME for the TSB API. You will also need to update the CNAME for RDS instance to point to the promoted instance.

Update the ES_HOSTS environment variable in the Zipkin deployment (both containers) in the tcc namespace. Update the SW_STORAGE_ES_CLUSTER_NODES environment variable in the oap-deployment deployment in the tcc namespace.

Time for failover based on the above procedure - less than 30 minutes.

Failback

Once the main region is back online, the failback procedure should be as \described below.

Delete RDS instance in the main region (if present).
Delete tcc namespace in original Kubernetes service (if present).
Setup a read replica from the current RDS instance to the main region.
Take a Velero backup in the running TSB instance.
Once the new RDS read replica is ready, promote it to be a master instance.
Setup the Elasticsearch backup job in the EKS cluster in the secondary region to take a snapshot of the Elasticsearch cluster in that region.
Setup the restore Elasticsearch snapshot job in the EKS cluster in the main region to restore Elasticsearch data.
Restore backup in a new EKS cluster in the main region.
Update both CNAME records to the new front envoy service and new RDS instance.
Delete restore job in main EKS cluster, backup job in secondary EKS cluster.

Appendix

Sizing

See below for the sizing used in each component for the tests described.

RDS

Amazon RDS for Postgres.
Instance class db.t2.micro.
Multi-availability zone setup using a DB subnet group spread on 3 subnets, each of them in a different availability zone.
Backups enabled with 1 day retention (requirement in case of setting up a read replica in DR scenario).
Any setting not listed above uses default AWS settings.

EKS

Control plane spread on 3 subnets, each of them in a different availability zone.
3 node pools each of them in a different availability zone. Node pool instance size t3.large.
Any setting not listed above uses default AWS settings.

Elasticsearch

Elasticsearch domain multi availability zone aware in 3 availability zone, using 3 subnets, each of them in a different availability zone.
3 dedicated master instances with instance type m5.large.elasticsearch.
3 dedicated data nodes with instance type m5.large.elasticsearch, with a 50GB EBS type gp2 each.

Any setting not listed above uses default AWS settings.

Security groups

RDS

RDS security group allows access to port 5432/tcp from the CIDR blocks of both VPCs (main and secondary). If the DR scenario with multiple TSB is not to be considered, then it can be restricted to allow only from the CIDR block of the VPC where the RDS instance is deployed.

EKS

Standard EKS cluster security groups.

Elasticsearch

Allow 443/tcp from the CIDR blocks of both VPCs (main and secondary). If the DR scenario with multiple TSB is not to be considered, then it can be restricted to allow only from the CIDR block of the VPC where the Elasticsearch domain is deployed.

Kubernetes YAMLs

Create repository in Elasticsearch

apiVersion: batch/v1
kind: Job
metadata:
  name: create-elasticsearch-repo
  namespace: tcc
  labels:
    app: create-elasticsearch-repo
spec:
  template:
    metadata:
      labels:
        app: create-elasticsearch-repo
    spec:
      containers:
        - env:
            ## include https:// and trailing /
            - name: AWS_ES_ENDPOINT
              value: [endpoint for VPC Elasticsearch]
            - name: AWS_REGION
              value: [region of the Elasticsearch domain]
            - name: REPO_NAME
              value: s3_repo_job
            - name: S3_BUCKET_NAME
              value: [destination bucket name]
            - name: S3_ACCESS_ROLE_ARN
              value: [IAM role to provide access to S3 bucket]
          command:
            - /bin/sh
            - -c
            - |
              cat <<EOF > /tmp/create_repo.py
              import boto3
              import os
              import requests
              from requests_aws4auth import AWS4Auth

              host = os.getenv('AWS_ES_ENDPOINT')
              region = os.getenv('AWS_REGION')
              service = 'es'
              credentials = boto3.Session().get_credentials()
              awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)

              # Register repository
              repo_name = os.getenv('REPO_NAME')
              path = '_snapshot/{}'.format(repo_name) # the Elasticsearch API endpoint
              url = host + path

              payload = {
                "type": "s3",
                "settings": {
                  "bucket": os.getenv('S3_BUCKET_NAME'),
                  # "endpoint": "us-east-1", # for us-east-1
                  # "region": "us-west-1", # for all other regions
                  "role_arn": os.getenv('S3_ACCESS_ROLE_ARN')
                }
              }

              headers = {"Content-Type": "application/json"}

              r = requests.put(url, auth=awsauth, json=payload, headers=headers)

              print(r.status_code)
              print(r.text)
              EOF
              pip install requests_aws4auth boto3 requests
              python /tmp/create_repo.py
          image: banst/awscli:1.18.30
          imagePullPolicy: IfNotPresent
          name: awscli
      restartPolicy: Never
      # volumes:
      # - configMap:
      #     defaultMode: 420
      #     name: restore-config-yaml
      #   name: restore-config-yaml

Create snapshot

apiVersion: v1
kind: ConfigMap
metadata:
  name: snapshot-config-yaml
  namespace: tcc
data:
  action.yaml: |
    actions:
      1:
        action: snapshot
        description: >-
          Snapshot indices in main region to S3
        options:
          repository: [ES repository name]
          # Leaving name blank will result in the default 'curator-%Y%m%d%H%M%S'
          name:
          wait_for_completion: True
          max_wait: 3600
          wait_interval: 10
        filters:
        - filtertype: pattern
          kind: regex
          value: ".*$"
  curator.yaml: |
    client:
      hosts:
      - [only the host part of the endpoint for VPC Elasticsearch]
      port: 443
      url_prefix:
      use_ssl: True
      ssl_no_validate: True
      timeout: 30
      master_only: False
    logging:
      loglevel: INFO
      logfile:
      logformat: default
      blacklist: ['elasticsearch', 'urllib3']
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: elasticsearch-snapshot
  namespace: tcc
  labels:
    app: elasticsearch-snapshot
spec:
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 60
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: elasticsearch-snapshot
        spec:
          containers:
          - args:
            - --config
            - /etc/curator/curator.yaml
            - /etc/curator/action.yaml
            image: [docker registry]/curator:5.7.6
            imagePullPolicy: IfNotPresent
            name: curator
            volumeMounts:
            - mountPath: /etc/curator/
              name: snapshot-config-yaml
          restartPolicy: OnFailure
          volumes:
          - configMap:
              defaultMode: 420
              name: snapshot-config-yaml
            name: snapshot-config-yaml
  schedule: 0 * * * *

Restore snapshot

apiVersion: v1
kind: ConfigMap
metadata:
  name: restore-config-yaml
  namespace: tcc
data:
  action.yaml: |
    actions:
      1:
        action: delete_indices
        description: Delete indices before restore
        options:
          ignore_empty_list: True
        filters:
          - filtertype: pattern
            kind: regex
            value: ".*$"
      2:
        action: restore
        description: >-
          Restore all indices from the most recent snapshot to secondary region
        options:
          repository: [ES repository name]
          # If name is blank, the most recent snapshot by age will be selected
          name:
          # If indices is blank, all indices in the snapshot will be restored
          indices:
          wait_for_completion: True
          max_wait: 3600
          wait_interval: 10
        filters:
          - filtertype: pattern
            kind: regex
            value: ".*$"
  curator.yaml: |
    client:
      hosts:
      - [only the host part of the endpoint for VPC Elasticsearch]
      port: 443
      url_prefix:
      use_ssl: True
      ssl_no_validate: True
      timeout: 30
      master_only: False
    logging:
      loglevel: INFO
      logfile:
      logformat: default
      blacklist: ['elasticsearch', 'urllib3']
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: elasticsearch-restore
  namespace: tcc
  labels:
    app: elasticsearch-restore
spec:
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 60
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: elasticsearch-restore
        spec:
          containers:
          - args:
            - --config
            - /etc/curator/curator.yaml
            - /etc/curator/action.yaml
            image: [docker registry]/curator:5.7.6
            imagePullPolicy: IfNotPresent
            name: curator
            volumeMounts:
            - mountPath: /etc/curator/
              name: restore-config-yaml
          restartPolicy: OnFailure
          volumes:
          - configMap:
              defaultMode: 420
              name: restore-config-yaml
            name: restore-config-yaml
  schedule: 0 * * * *

Multi-Availability Zone​

TSB​

TSBD​

Disaster recovery - Multi Region​

Postgres DR​

Elasticsearch DR​

Multiple TSB management planes​

TSB setup​

Failover​

Failback​

Single TSB management plane​

TSB setup​

Failover​

Failback​

Appendix​

Sizing​

RDS​

EKS​

Elasticsearch​

Security groups​

RDS​

EKS​

Elasticsearch​

Kubernetes YAMLs​

Create repository in Elasticsearch​

Create snapshot​

Restore snapshot​

Multi-Availability Zone

TSB

TSBD

Disaster recovery - Multi Region

Postgres DR

Elasticsearch DR

Multiple TSB management planes

TSB setup

Failover

Failback

Single TSB management plane

TSB setup

Failover

Failback

Appendix

Sizing

RDS

EKS

Elasticsearch

Security groups

RDS

EKS

Elasticsearch

Kubernetes YAMLs

Create repository in Elasticsearch

Create snapshot

Restore snapshot