Distributed Tracing Integration
This document assumes the reader to have basic knowledge of Distributed Tracing concepts and nouns. If not familiar with Distributed Tracing, it is advised to first read up on Distributed Tracing before following this document and adjusting the Distributed Tracing configuration of TSB. For a good introduction to Distributed Tracing concepts, read this excellent blog by Nic Munroe.
Distributed Tracing does not work out of the box as it is important for your
deployed services to propagate trace context. Without enabling context
propagation in your services, you will experience broken traces and see a highly
diminished value in traces. We suggest to minimally support propagating the B3
and W3C trace context headers as well as the x-request-id
for request
correlation. See also the trace context propagation explanation
in the Istio documentation. Next to context propagation it is a very good idea
to include the x-request-id
(and possibly the trace id
from distributed
tracing) in all request bound log lines in your services. This enables near
effortless correlation between request traces and service logs and speeds up
troubleshooting tremendously.
By default, TSB provides a SkyWalking powered, Zipkin compatible distributed tracing backend. All Envoy ingress gateways and sidecars, under TSB’s control, have their internal Zipkin tracing instrumentation set to send span data straight to TSB’s SkyWalking collectors. A fixed global sampling rate can also be configured through TSB’s ControlPlane resource object.
If needing more flexibility in setting more granular sampling rates, using different tracing instrumentation, or sending span data to different backends; this document will provide you with the needed context to make the required changes.
Istio Telemetry API
The Istio Telemetry API provides granular and flexible methods to adjust
observability signals at runtime through the use of scoped Telemetry
objects.
Prior to the Istio Telemetry API it was required to adjust the TSB control plane
and data plane operator configuration objects to configure a single distributed
tracer with a fixed sampling rate.
After enabling tracing extension providers for the Istio Telemetry API through the TSB control plane operator configuration object, it is possible to set specific tracers with different sampling rates for different namespaces using the Istio Telemetry objects.
While the Telemetry API has been around since Istio 1.12, it is still marked as in alpha state. This is mostly due to the ability for fine-grained configuration with many potential edge cases for tracing, metrics, and logging. We have tested and validated cluster level tracing configuration to be functional for Zipkin, OpenCensus, and OpenTelemetry tracing providers but do not guarantee successful configuration use cases beyond this. Istio does not and will not provide native OpenTelemetry support without the use of the Telemetry API.
To learn more about the Istio Telemetry API, see the Istio Telemetry API.
W3C Trace Context Propagation
By default TSB, through Envoy’s native Zipkin tracer instrumentation, uses the
well known B3
trace context propagation method. B3
is one of the best
supported propagation methods available for a variety of Distributed Tracing
ecosystems, as it has been the de facto standard for many site owners who have
adopted Distributed Tracing early on (e.g. Netflix).
The Zipkin ecosystem originated at Twitter, where most services had names of
birds. Zipkin’s backend internal project name was Big Brother Bird. When the
Zipkin ecosystem was open sourced, the B3
trace context headers were kept as
is.
During a Distributed Tracing workshop in 2019, hosted by the Zipkin open source community, a few engineers from different organization where invited and came together to figure out a new context propagation method that would allow for tracing systems to interoperate even though some of them had different (optional) metadata requirements (most notably projects from Microsoft, AWS, and DynaTrace).
The idea was to have a common base context understood by all systems in one
header (traceparent
).
Another header (tracestate
) can contain multiple chunks of metadata from
different tracing vendors. If a specific chunk of metadata is understood, it can
be interacted with. Tracers can add their own metadata but are required to
propagate the other tracing vendor metadata up to the maximum size of the header
value. Metadata chunks then get purged in a FIFO manner.
To put more weight to the newly proposed solution, the effort was taken to the
W3C, hence this propagation format being called the W3C Trace Context
. When
OpenTelemetry came into existence through CNCF dictated merging of Google’s
OpenCensus project (back then using B3
for propagation) and vendor consortium
backed OpenTracing (having no guarantees at all on trace context propagation,
each vendor used their own), it was decided to switch the default from B3
to
W3C Trace Context
. However, most OpenTelemetry instrumentation supports B3
after a small configuration change.
Switching from B3
context propagation to W3C Trace Context
in a TSB
environment can be accomplished by changing the active Envoy tracing
implementation. For a TSB 1.6 cluster, the only choice is OpenCensus
. It is
advised to not use this tracer going forward as OpenCensus has been deprecated
and is no longer maintained. The tracer will be removed in a future version of
Envoy Proxy and highly likely also the OpenTelemetry collector. When upgrading
to TSB 1.7 and up, it is advised to switch to the OpenTelemetry
tracer.
OpenTelemetry Collector for Tracing
The OpenTelemetry collector is a “swiss army knife” of span data management. It can receive span data in different formats from various different tracing instrumentations and export this data out to multiple backends, potentially using different span data formats. In this document we will show how an OpenTelemetry Collector can be used to receive span data from incoming Zipkin, OpenCensus, and OpenTelemetry tracing instrumentation to be exported to an OpenTelemetry compatible backend as well as to TSB’s embedded tracing backend.
Enabling the Telemetry API for tracing in TSB
To make any change to TSB’s tracing configuration as described in this document, you first need to enable tracing extension providers for the Istio Telemetry API in TSB. For this you are required to adjust the TSB ControlPlane resource object for each cluster in your environment.
The TSB operator, using its ControlPlane
resource object, manages the
configuration and deployment of its Istio dependency. When a TSB ControlPlane
object is applied, the TSB operator will create an IstioOperator
resource object. This resulting object is then used to (re)configure the Istio
deployment. To enable the Telemetry API for tracing, we need to patch this
generated IstioOperator
resource object through the TSB ControlPlane
object
using an overlay
.
TSB ControlPlane resource object overlay
To make sure we don’t overwrite important custom configurations found in the
ControlPlane
object, we download the current state first. The following steps
need to be repeated for each cluster you want to adjust.
Fetch the ControlPlane resource object by running:
kubectl get -n istio-system controlplane controlplane \
-o yaml > controlplane.yaml
Take a note of the value of the clusterName
as found in the managementPlane
section. You will need this value later when configuring the Istio Telemetry
object.
Edit the ControlPlane object by adding a patch for the IstioOperator
object
like this:
apiVersion: install.tetrate.io/v1alpha1
kind: ControlPlane
metadata:
...
spec:
components:
...
istio:
kubeSpec:
# start of overlay
overlays:
- apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
name: tsb-istiocontrolplane
patches:
- path: spec.meshConfig.extensionProviders
value:
# You can list multiple trace configurations here!
# They can be different tracers as well as different configurations
# for the same tracing instrumentation.
- name: <tracing-config-name>
<extensionProvider>
service: <ip_or_host>
port: <port_number>
# optional default extension provider patch; not required
# warning: this will inject trace headers even if tracing is disabled
# for a particular namespace. Make sure this is a desired side effect.
- path: spec.meshConfig.defaultProviders.tracing
value:
tracing:
# Even though this is a list, only one default tracer is supported!
- <tracing-config-name>
# end of overlay
deployment:
...
To install the adjusted ControlPlane resource object:
kubectl apply -f controlplane.yaml
The required part of the patch is to provide one or multiple tracing
configurations to the spec.meshConfig.extensionProviders
configuration type. Setting a patch for
spec.meshConfig.defaultProviders.tracing
has the side effect that all request
traffic will be augmented with the trace headers of the default trace config
instrumentation, even if your Telemetry API configuration does not explicitly
set tracing config for the incoming request. Our advice is to not set the
default as a patch but rather rely on your Telemetry API resource objects unless
you use the trace id for request correlation in logs even if distributed tracing
is disabled.
Here is an example of a flexible set-up configuration with multiple trace
configurations, allowing the use of both B3
as well as W3C trace-context
tracers with the ability to send the data to TSB, an external Jaeger tracing
backend, or both.
Note that it is possible to add the same tracer type multiple times but each with different endpoint configurations. This can be very handy to designate one tracing backend as specific for troubleshooting purposes or provide app teams their own setups.
Jaeger’s Zipkin support can be activated through the following command line
argument --collector.zipkin.host-port=:9411
. In the example below this is
required as it enables the “jaeger-b3” tracing configuration to send data
straight to Jaeger without an OpenTelemetry collector in between.
Jaeger versions v1.35 and up have native support for OpenTelemetry’s OTLP
transport, older versions will need an OpenTelemetry collector in between. In
the example below OTLP support is assumed as it enables the “jaeger-w3c” tracing
configuration to send data straight to Jaeger without an OpenTelemetry
collector in between.
patches:
- path: spec.meshConfig.extensionProviders
value:
- name: tsb-b3 # Zipkin tracer to TSB backend
zipkin:
service: "zipkin.istio-system.svc.cluster.local"
port: 9411
- name: jaeger-b3 # Zipkin tracer to Jaeger backend
zipkin:
service: "jaeger-collector.default.svc.cluster.local"
port: 9411
- name: jaeger-w3c # OTel tracer to Jaeger backend
opentelemetry:
service: "jaeger-collector.default.svc.cluster.local"
port: 4317
- name: both-b3 # Zipkin tracer to OTel collector
zipkin:
service: "otel-collector.default.svc.cluster.local"
port: 9411
- name: both-w3c # OTel tracer to OTel collector
opentelemetry:
service: "otel-collector.default.svc.cluster.local"
port: 4317
Since it is easy to make mistakes when dealing with overlays and edited resource objects are typically not being denied if the yaml is syntactically correct, it is a good idea to tail the logs of the TSB operator while applying the resource object. If you see an apply error, it is most likely you’ve made a typo in the patch or applied incorrect indentation in the patch value.
To tail TSB operator logs to inspect if your overlay is successfully processed, run the following command:
kubectl logs -n istio-system -l name=tsb-operator -f
Setting up an OpenTelemetry collector for tracing
In the above example of extension provider configurations it is assumed that sending span data to an OpenTelemetry collector results in this collector sending the data to an OTLP compatible Jaeger backend as well as TSB’s SkyWalking collector which expects Zipkin data. The default OpenTelemetry collector supports OTLP and Zipkin receivers and exporters by default.
If using a vendor specific OpenTelemetry collector (e.g. the Splunk OpenTelemetry distribution) it is common to have wide support of receivers but very limited support of exporters (often just OTLP and native vendor exporters). In those cases you will need to create your own OpenTelemetry distribution to support your vendor’s exporters as well as the Zipkin exporter, if it is required to feed the span data back to TSB. If the OTLP export is available, a daisy-chained OpenTelemetry collector solution is possible, although inefficient.
If setting up an OpenTelemetry collector to support the here presented use case, the helm chart values object could look like this:
This is not a production ready OpenTelemetry configuration.
mode: deployment
fullnameOverride: otel-collector # this sets to otel service name, default is very verbose
replicaCount: 1
config:
extensions:
health_check: {}
processors:
batch: {}
receivers:
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
http:
endpoint: ${env:MY_POD_IP}:4318
zipkin:
endpoint: ${env:MY_POD_IP}:9411
exporters:
zipkin/tsb:
endpoint: http://zipkin.istio-system.svc:9411/api/v2/spans
otlp/jaeger:
endpoint: jaeger-collector.default.svc:4317
tls:
insecure: true
service:
extensions:
- health_check
pipelines:
traces:
receivers:
- otlp
- zipkin
processors:
- batch
exporters:
- zipkin/tsb
- otlp/jaeger
Installation of OpenTelemetry Collector through helm looks like this:
helm install otel-trace open-telemetry/opentelemetry-collector \
--values otel-collector-values.yaml
Setting up a Jaeger backend
In this example we assume the availability of a Jaeger backend. Here is a demo configuration to deploy Jaeger using the Jaeger operator.
Install the Jaeger operator:
kubectl create namespace observability
kubectl create -n observability -f \
https://github.com/jaegertracing/jaeger-operator/releases/download/v1.49.0/jaeger-operator.yaml
This is not a production ready Jaeger configuration.
After the Jaeger deployment has successfully been installed, you can create the required Jaeger deployment configuration. For this example we’ll use the demo all-in-one image with in-memory storage and enable Jaeger’s Zipkin collector.
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: default
spec:
strategy: allInOne
allInOne:
options:
collector:
zipkin:
host-port: ":9411"
Apply the Jaeger object to configure and deploy the Jaeger All-in-one solution:
kubectl apply -f jaeger.yaml
After applying the object, you should see the jaeger instance being deployed in the default namespace:
kubectl get pods -l app.kubernetes.io/instance=jaeger
By default, Jaeger operator creates an ingress route for you to access the UI. You can retrieve the address information by executing the following command:
kubectl get ingress
The default ingress creation behavior of Jaeger is quite insecure. If you are uncomfortable with this behavior please adjust the Jaeger configuration. For more information on Jaeger configuration, see the Jaeger operator documentation.
Using the Telemetry API
By enabling the extension providers we have muted the traditional Zipkin instrumentation. TSB will not trace requests without specifying the required behavior through the Istio Telemetry API.
The Istio Telemetry API specification dictates that only one globally scoped
Telemetry object can be applied to the root namespace istio-system
. Certain
versions of TSB v1.7.x automatically created a global Telemetry object named
xcp-mesh-default
as part of the installation/upgrade process where it handled
improved global metrics configuration for the Istio deployment it uses. In TSB
v1.8 and up, this global Telemetry object is no longer created or used. If you
added your own global Telemetry configuration inside this object, you can
continue to use it as is. It is also allowed to delete this object and create a
new one with an object name and contents of your choosing.
To enable a mesh-wide default using the “both-b3” tracing configuration you’ve registered earlier, you can create a new global Telemetry object like the one below.
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: "both-b3" # use one of the extension provider tracing configurations here
customTags:
cluster:
literal:
value: "app-cluster-1" # use the TSB clusterName here!
tracer: # it is smart to add a tracer tag to highlight the used configuration
literal:
value: "both-b3"
randomSamplingPercentage: 100.0 # use the desired sampling rate here
kubectl apply -f mesh-default.yaml
By switching the tracing provider configuration name you can switch between B3 and W3C context propagation as well as sending straight to TSB, straight to Jaeger, or the OpenTelemetry collector for feeding both TSB and Jaeger.
For more information and examples see the Istio Telemetry API documentation.