Tetrate Service BridgeVersion: 1.6.x

Distributed Tracing Integration

Basic Distributed Tracing knowledge required

This document assumes the reader to have basic knowledge of Distributed Tracing concepts and nouns. If not familiar with Distributed Tracing, it is advised to first read up on Distributed Tracing before following this document and adjusting the Distributed Tracing configuration of TSB. For a good introduction to Distributed Tracing concepts, read this excellent blog by Nic Munroe.

Required Service behavior

Distributed Tracing does not work out of the box as it is important for your deployed services to propagate trace context. Without enabling context propagation in your services, you will experience broken traces and see a highly diminished value in traces. We suggest to minimally support propagating the B3 and W3C trace context headers as well as the x-request-id for request correlation. See also the trace context propagation explanation in the Istio documentation. Next to context propagation it is a very good idea to include the x-request-id (and possibly the trace id from distributed tracing) in all request bound log lines in your services. This enables near effortless correlation between request traces and service logs and speeds up troubleshooting tremendously.

By default, TSB provides a SkyWalking powered, Zipkin compatible distributed tracing backend. All Envoy ingress gateways and sidecars, under TSB’s control, have their internal Zipkin tracing instrumentation set to send span data straight to TSB’s SkyWalking collectors. A fixed global sampling rate can also be configured through TSB’s ControlPlane resource object.

If needing more flexibility in setting more granular sampling rates, using different tracing instrumentation, or sending span data to different backends; this document will provide you with the needed context to make the required changes.

Istio Telemetry API

The Istio Telemetry API provides granular and flexible methods to adjust observability signals at runtime through the use of scoped Telemetry objects. Prior to the Istio Telemetry API it was required to adjust the TSB control plane and data plane operator configuration objects to configure a single distributed tracer with a fixed sampling rate.

After enabling tracing extension providers for the Istio Telemetry API through the TSB control plane operator configuration object, it is possible to set specific tracers with different sampling rates for different namespaces using the Istio Telemetry objects.

Telemetry API feature status

While the Telemetry API has been around since Istio 1.12, it is still marked as in alpha state. This is mostly due to the ability for fine-grained configuration with many potential edge cases for tracing, metrics, and logging. We have tested and validated cluster level tracing configuration to be functional for Zipkin, OpenCensus, and OpenTelemetry tracing providers but do not guarantee successful configuration use cases beyond this. Istio does not and will not provide native OpenTelemetry support without the use of the Telemetry API.

To learn more about the Istio Telemetry API, see the Istio Telemetry API.

W3C Trace Context Propagation

By default TSB, through Envoy’s native Zipkin tracer instrumentation, uses the well known B3 trace context propagation method. B3 is one of the best supported propagation methods available for a variety of Distributed Tracing ecosystems, as it has been the de facto standard for many site owners who have adopted Distributed Tracing early on (e.g. Netflix).

Why the name B3?

The Zipkin ecosystem originated at Twitter, where most services had names of birds. Zipkin’s backend internal project name was Big Brother Bird. When the Zipkin ecosystem was open sourced, the B3 trace context headers were kept as is.

During a Distributed Tracing workshop in 2019, hosted by the Zipkin open source community, a few engineers from different organization where invited and came together to figure out a new context propagation method that would allow for tracing systems to interoperate even though some of them had different (optional) metadata requirements (most notably projects from Microsoft, AWS, and DynaTrace).

The idea was to have a common base context understood by all systems in one header (traceparent). Another header (tracestate) can contain multiple chunks of metadata from different tracing vendors. If a specific chunk of metadata is understood, it can be interacted with. Tracers can add their own metadata but are required to propagate the other tracing vendor metadata up to the maximum size of the header value. Metadata chunks then get purged in a FIFO manner.

To put more weight to the newly proposed solution, the effort was taken to the W3C, hence this propagation format being called the W3C Trace Context. When OpenTelemetry came into existence through CNCF dictated merging of Google’s OpenCensus project (back then using B3 for propagation) and vendor consortium backed OpenTracing (having no guarantees at all on trace context propagation, each vendor used their own), it was decided to switch the default from B3 to W3C Trace Context. However, most OpenTelemetry instrumentation supports B3 after a small configuration change.

Switching from B3 context propagation to W3C Trace Context in a TSB environment can be accomplished by changing the active Envoy tracing implementation. For a TSB 1.6 cluster, the only choice is OpenCensus. It is advised to not use this tracer going forward as OpenCensus has been deprecated and is no longer maintained. The tracer will be removed in a future version of Envoy Proxy and highly likely also the OpenTelemetry collector. When upgrading to TSB 1.7 and up, it is advised to switch to the OpenTelemetry tracer.

OpenTelemetry Collector for Tracing

The OpenTelemetry collector is a “swiss army knife” of span data management. It can receive span data in different formats from various different tracing instrumentations and export this data out to multiple backends, potentially using different span data formats. In this document we will show how an OpenTelemetry Collector can be used to receive span data from incoming Zipkin, OpenCensus, and OpenTelemetry tracing instrumentation to be exported to an OpenTelemetry compatible backend as well as to TSB’s embedded tracing backend.

Enabling the Telemetry API for tracing in TSB

To make any change to TSB’s tracing configuration as described in this document, you first need to enable tracing extension providers for the Istio Telemetry API in TSB. For this you are required to adjust the TSB ControlPlane resource object for each cluster in your environment.

The TSB operator, using its ControlPlane resource object, manages the configuration and deployment of its Istio dependency. When a TSB ControlPlane object is applied, the TSB operator will create an IstioOperator resource object. This resulting object is then used to (re)configure the Istio deployment. To enable the Telemetry API for tracing, we need to patch this generated IstioOperator resource object through the TSB ControlPlane object using an overlay.

TSB ControlPlane resource object overlay

To make sure we don’t overwrite important custom configurations found in the ControlPlane object, we download the current state first. The following steps need to be repeated for each cluster you want to adjust.

Fetch the ControlPlane resource object by running:

kubectl get -n istio-system controlplane controlplane \
  -o yaml > controlplane.yaml

Cluster name

Take a note of the value of the clusterName as found in the managementPlane section. You will need this value later when configuring the Istio Telemetry object.

Edit the ControlPlane object by adding a patch for the IstioOperator object like this:

apiVersion: install.tetrate.io/v1alpha1
kind: ControlPlane
metadata:
  ...
spec:
  components:
    ...
    istio:
      kubeSpec:
        # start of overlay
        overlays:
          - apiVersion: install.istio.io/v1alpha1
            kind: IstioOperator
            name: tsb-istiocontrolplane
            patches:
              - path: spec.meshConfig.extensionProviders
                value:
                # You can list multiple trace configurations here!
                # They can be different tracers as well as different configurations
                # for the same tracing instrumentation.
                  - name: <tracing-config-name>
                  <extensionProvider>
                    service: <ip_or_host>
                    port: <port_number>
              # optional default extension provider patch; not required
              # warning: this will inject trace headers even if tracing is disabled
              # for a particular namespace. Make sure this is a desired side effect.
              - path: spec.meshConfig.defaultProviders.tracing
                value:
                  tracing:
                  # Even though this is a list, only one default tracer is supported!
                   - <tracing-config-name>
        # end of overlay
        deployment:
          ...

To install the adjusted ControlPlane resource object:

kubectl apply -f controlplane.yaml

The required part of the patch is to provide one or multiple tracing configurations to the spec.meshConfig.extensionProviders configuration type. Setting a patch for spec.meshConfig.defaultProviders.tracing has the side effect that all request traffic will be augmented with the trace headers of the default trace config instrumentation, even if your Telemetry API configuration does not explicitly set tracing config for the incoming request. Our advice is to not set the default as a patch but rather rely on your Telemetry API resource objects unless you use the trace id for request correlation in logs even if distributed tracing is disabled.

Here is an example of a flexible set-up configuration with multiple trace configurations, allowing the use of both B3 as well as W3C trace-context tracers with the ability to send the data to TSB, an external Jaeger tracing backend, or both.

Multiple Tracing Backends

Note that it is possible to add the same tracer type multiple times but each with different endpoint configurations. This can be very handy to designate one tracing backend as specific for troubleshooting purposes or provide app teams their own setups.

Native Zipkin support for the Jaeger backend

Jaeger’s Zipkin support can be activated through the following command line argument --collector.zipkin.host-port=:9411. In the example below this is required as it enables the “jaeger-b3” tracing configuration to send data straight to Jaeger without an OpenTelemetry collector in between.

            patches:
              - path: spec.meshConfig.extensionProviders
                value:
                  - name: tsb-b3 # Zipkin tracer to TSB backend
                    zipkin:
                      service: "zipkin.istio-system.svc.cluster.local"
                      port: 9411
                  - name: jaeger-b3 # Zipkin tracer to Jaeger backend
                    zipkin:
                      service: "jaeger-collector.default.svc.cluster.local"
                      port: 9411
                  - name: both-b3 # Zipkin tracer to OTel collector
                    zipkin:
                      service: "otel-collector.default.svc.cluster.local"
                      port: 9411
                  - name: both-w3c # OpenCensus W3C tracer to OTel collector
                    opencensus:
                      service: "otel-collector.default.svc.cluster.local"
                      port: 55678
                      context:
                        - W3C_TRACE_CONTEXT

Check TSB Operator logs

Since it is easy to make mistakes when dealing with overlays and edited resource objects are typically not being denied if the yaml is syntactically correct, it is a good idea to tail the logs of the TSB operator while applying the resource object. If you see an apply error, it is most likely you’ve made a typo in the patch or applied incorrect indentation in the patch value.

To tail TSB operator logs to inspect if your overlay is successfully processed, run the following command:

kubectl logs -n istio-system -l name=tsb-operator -f

Setting up an OpenTelemetry collector for tracing

In the above example of extension provider configurations it is assumed that sending span data to an OpenTelemetry collector results in this collector sending the data to an OTLP compatible Jaeger backend as well as TSB’s SkyWalking collector which expects Zipkin data. The default OpenTelemetry collector supports OTLP, OpenCensus, and Zipkin receivers and exporters by default.

Vendor OpenTelemetry distributions

If using a vendor specific OpenTelemetry collector (e.g. the Splunk OpenTelemetry distribution) it is common to have wide support of receivers but very limited support of exporters (often just OTLP and native vendor exporters). In those cases you will need to create your own OpenTelemetry distribution to support your vendor’s exporters as well as the Zipkin exporter, if it is required to feed the span data back to TSB. If the OTLP export is available, a daisy-chained OpenTelemetry collector solution is possible, although inefficient.

If setting up an OpenTelemetry collector to support the here presented use case, the helm chart values object could look like this:

Demo configuration

This is not a production ready OpenTelemetry configuration.

mode: deployment
fullnameOverride: otel-collector # this sets to otel service name, default is very verbose
replicaCount: 1
config:
  extensions:
    health_check: {}
  processors:
    batch: {}
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
        http:
          endpoint: ${env:MY_POD_IP}:4318
    zipkin:
      endpoint: ${env:MY_POD_IP}:9411
    opencensus: # Only required for TSB 1.6 W3C use case, remove for TSB 1.7+
      endpoint: ${env:MY_POD_IP}:55678
  exporters:
    zipkin/tsb:
      endpoint: http://zipkin.istio-system.svc:9411/api/v2/spans
    otlp/jaeger:
      endpoint: jaeger-collector.default.svc:4317
      tls:
        insecure: true
  service:
    extensions:
      - health_check
    pipelines:
      traces:
        receivers:
          - otlp
          - zipkin
          - opencensus # Only required for TSB 1.6 W3C use case, remove for TSB 1.7+
        processors:
          - batch
        exporters:
          - zipkin/tsb
          - otlp/jaeger

Installation of OpenTelemetry Collector through helm looks like this:

helm install otel-trace open-telemetry/opentelemetry-collector \
  --values otel-collector-values.yaml

Setting up a Jaeger backend

In this example we assume the availability of a Jaeger backend. Here is a demo configuration to deploy Jaeger using the Jaeger operator.

Install the Jaeger operator:

kubectl create namespace observability
kubectl create -n observability -f \
    https://github.com/jaegertracing/jaeger-operator/releases/download/v1.49.0/jaeger-operator.yaml

Demo configuration

This is not a production ready Jaeger configuration.

After the Jaeger deployment has successfully been installed, you can create the required Jaeger deployment configuration. For this example we’ll use the demo all-in-one image with in-memory storage and enable Jaeger’s Zipkin collector.

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: default
spec:
  strategy: allInOne
  allInOne:
    options:
      collector:
        zipkin:
          host-port: ":9411"

Apply the Jaeger object to configure and deploy the Jaeger All-in-one solution:

kubectl apply -f jaeger.yaml

After applying the object, you should see the jaeger instance being deployed in the default namespace:

kubectl get pods -l app.kubernetes.io/instance=jaeger

By default, Jaeger operator creates an ingress route for you to access the UI. You can retrieve the address information by executing the following command:

kubectl get ingress

Unprotected UI access

The default ingress creation behavior of Jaeger is quite insecure. If you are uncomfortable with this behavior please adjust the Jaeger configuration. For more information on Jaeger configuration, see the Jaeger operator documentation.

Using the Telemetry API

By enabling the extension providers we have muted the traditional Zipkin instrumentation. TSB will not trace requests without specifying the required behavior through the Istio Telemetry API.

To enable a mesh-wide default using the “both-b3” tracing configuration you’ve created earlier, you can provide a new global Telemetry object like the one below:

Only one global telemetry object allowed

The Istio Telemetry API specification dictates that only one globally scoped Telemetry object can be applied to the root namespace istio-system. When upgrading TSB beyond v1.6 will automatically create a global Telemetry object (called xcp-mesh-default) as part of the upgrade process where it handles improved global metrics configuration for the newer Istio deployment it uses. If you have created a custom global Telemetry object for tracing in TSB v1.6, you will need to remove it before upgrading TSB and augment the Telemetry configuration from this object into the xcp-mesh-default object after the upgrade.

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system
spec:
  tracing:
  - providers:
    - name: "both-b3" # use one of the extension provider tracing configurations here
    customTags:
      cluster:
        literal:
          value: "app-cluster-1" # use the TSB clusterName here!
    randomSamplingPercentage: 100.0 # use the desired sampling rate here

By switching the tracing provider configuration name you can switch between B3 and W3C context propagation as well as sending straight to TSB, straight to Jaeger, or the OpenTelemetry collector for feeding both TSB and Jaeger.

For more information and examples see the Istio Telemetry API documentation.

Istio Telemetry API​

W3C Trace Context Propagation​

OpenTelemetry Collector for Tracing​

Enabling the Telemetry API for tracing in TSB​

TSB ControlPlane resource object overlay​

Setting up an OpenTelemetry collector for tracing​

Setting up a Jaeger backend​

Using the Telemetry API​