Tetrate Service BridgeVersion: 1.9.x

Metric

A metric is a measurement about a service, captured at runtime. Logically, the moment of capturing one of these measurements is known as a metric event which consists not only of the measurement itself, but the time that it was captured and associated metadata..

The key aspects of a metric are the measure, the metric type, the metric origin, and the metric detect point:

The measure describes the type and unit of a metric event also known as measurement.
The metric type is the aggregation over time applied to the measurements.
The metric origin tells from where the metric measurements come from.
The detect point is the point from which the metric is observed, in service, server side, or client side. It is useful to differentiate between metrics that observe a concrete service (often self observing), or metrics that focus on service to service communications.

An TSB controlled (is part of the mesh and has a proxy we can configure) service has several metrics available which leverages a consistent monitoring of services. Some of them cover what is known as the RED metrics set, which are a set of very useful metrics for HTTP/RPC request based services. RED stands for:

Rate (R): The number of requests per second.
Errors (E): The number of failed requests.
Duration (D): The amount of time to process a request.

To understand a bit better which metrics are available given a concrete telemetry source, let's assume we have deployed the classic Istio bookinfo demo application. Let's see some RED based metrics available for an observed and managed service by TSB, for instance the review service using the GLOBAL scoped telemetry source.

The following metric is the number of request per minute that the reviews service is handling at a GLOBAL scope:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
  organization: myorg
  service: reviews.bookinfo
  source: reviews
  name: service_cpm
spec:
  observedResource: organizations/myorg/services/reviews.bookinfo
  measure:
    type: REQUESTS
    unit: "\{request\}"
  metricType:
    type: CPM
  origin: MESH_OBSERVED
  detectPoint: SERVER_SIDE

The metric for the average duration of the handled request by the reviews service at a GLOBAL scope:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
  organization: myorg
  service: reviews.bookinfo
  source: reviews
  name: service_resp_time
spec:
  observedResource: organizations/myorg/services/reviews.bookinfo
  measure:
    type: LATENCY
    unit: ms
  metricType:
    type: AVERAGE
  origin: MESH_OBSERVED
  detectPoint: SERVER_SIDE

The metric for the errors of the handled request by the reviews at a GLOBAL scope. In this case the number of errors are expresses as a percentage of the total number of handled requests:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
  organization: myorg
  service: reviews.bookinfo
  source: reviews
  name: service_sla
spec:
  observedResource: organizations/myorg/services/reviews.bookinfo
  measure:
    type: STATUS
    unit: NUMBER
  metricType:
    type: PERCENT
  origin: MESH_OBSERVED
  detectPoint: SERVER_SIDE

Using a different telemetry source for the same metric will gives a different view of the same observed measurements. For instance, if we want to know how many requests per minute subset v1 from the reviews is handling, we need to use the same metric but from a different telemetry source, in this case reviews-v1:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
  organization: myorg
  service: reviews.bookinfo
  source: reviews-v1
  name: service_cpm
spec:
  observedResource: organizations/myorg/services/reviews.bookinfo
  measure:
    type: REQUESTS
    unit: NUMBER
  metricType:
    type: CPM
  origin: MESH_OBSERVED
  detectPoint: SERVER_SIDE

The duration or latency measurements can also be aggregated in different percentiles over time. The duration percentiles for the handled request by the reviews at a GLOBAL scope:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
  organization: myorg
  service: reviews.bookinfo
  source: reviews
  name: service_percentile
spec:
  observedResource: organizations/myorg/services/reviews.bookinfo
  measure:
    type: LATENCY
    unit: ms
  metricType:
    type: PERCENTILE
    labels:
    - key: "0"
      value: "p50"
    - key: "1"
      value: "p75"
    - key: "2"
      value: "p90"
    - key: "3"
      value: "p05"
    - key: "4"
      value: "p99"
  origin: MESH_OBSERVED
  detectPoint: SERVER_SIDE

Measure

A measure represents the name and unit of a measurement. For example, request latency in ms and the number of errors are examples of measures to collect from a server. In this case latency would be the type and ms (millisecond) is the unit.

Field	Description	Validation Rule
name	string The name of the measure. For instance latency in ms. More reference values can be found at MeshControlledMeasureNames.	–
unit	string The unit of measure, which follow the unified code for units of measure. For COUNTABLE measures, as number of requests or network packets, SHOULD use the default unit, the unity, and annotations with curly braces to give additional meaning. For example {requests}, {packets}, {errors}, {faults}, etc.	–

Metric

Application and request metrics are important indicators of availability and performance. Custom metrics can provide insights into how availability indicators impact user experience or the business. Collected data can be used to alert of an outage or trigger scheduling decisions to scale up a deployment automatically upon high demand.

Field	Description	Validation Rule
observedResource	string OUTPUT_ONLY Which concrete TSB resource in the configuration hierarchy this metric observes and belongs to. For instance, a metric can observe a service, a concrete service workload (pod or Vm), or a gateway, or a workspace, or any other resource in the configuration hierarchy.	–
measure	tetrateio.api.tsb.observability.telemetry.v2.Measure Measure describes the name and unit of a metric event also know as measurement.	–
type	tetrateio.api.tsb.observability.telemetry.v2.MetricType The type of aggregation over time applied to the measurements.	–
origin	tetrateio.api.tsb.observability.telemetry.v2.MetricOrigin From where the metric measurements come from.	–
detectionPoint	tetrateio.api.tsb.observability.telemetry.v2.MetricDetectionPoint From which detection point the metric is observed, server side or client side. It is useful to differentiate between metrics that observe a concrete service (often self observing), or metrics that focus on service to service communications. In service to service observed metrics, the observation can be done at the client or the server side.	–

MetricType

Metric types are the aggregation function applied to the measurements that took place over a period of time. Some metric types like LABELED_COUNTER and PERCENTILE also additionally aggregated over the set of defined labels.

Field	Description	Validation Rule
name	tetrateio.api.tsb.observability.telemetry.v2.MetricType.Type The type of metric	–
labels	List of tetrateio.api.tsb.observability.telemetry.v2.MetricType.Label The labels associated with the metric type. Some aggregation function are not just applied over time. LABELED_COUNTER and PERCENTILE metric types also aggregate over their labels. For instance, a PERCENTILE metric type over the latency, will aggregate the measured latency over the different defined percentiles, p50, p75, p90, p95, and p99.	–

Label

Label of metric type. Also seen a other dimensions of aggregation besides the time interval on which measurements are aggregated over.

Field	Description	Validation Rule
key	string The label key.	–
value	string The label value, for instance p50, or p75.	–

MeshControlledMeasureNames

The name of measures available for a controlled service in the mesh.

Field	Number	Description
INVALID_MEASURE_TYPE	0
COUNTABLE	1	Represents discrete instances of a countable quantity. And integer count of something SHOULD use the default unit, the unity. Countable is a generalized measure name that can be used for many common countable quantities. Because of the generalized name, annotations with curly braces to give additional meaning. Networks packets, system paging faults are countable measures examples.
REQUESTS	2	Requests is a specialized countable measure that represents the number of requests.
LATENCY	3	The time taken by each request.
STATUS	4	The success or failure of a request.
HTTP_RESPONSE_CODE	5	The response code of the HTTP response, and if this request is the HTTP call. E.g. 200, 404, 302
RPC_RESPONSE_CODE	6	The value of the rpc response code.
SIDECAR_INTERNAL_ERROR_CODE	7	The sidecar/gateway proxy internal error code. The value is based on the implementation.
SIDECAR_RETRY_EXCEEDED	8	The sidecar/gateway proxy internal error code. The value is based on the implementation.
TCP_INFO_RECEIVED_BYTES	9	The received bytes of the TCP traffic, if this request is a TCP call.
TCP_INFO_SEND_BYTES	10	The sent bytes of the TCP traffic, if this request is a TCP call.
MTLS_IN_USE	11	If mutual tls is in use in the connections between services.
SIDECAR_HEAP_MEMORY_USED	12	Current reserved heap size in bytes. New Envoy process heap size on hot restart.
SIDECAR_MEMORY_ALLOCATED	14	Current amount of allocated memory in bytes. Total of both new and old Envoy processes on hot restart.
SIDECAR_PHYSICAL_MEMORY	15	Current estimate of total bytes of the physical memory. New Envoy process physical memory size on hot restart.
SIDECAR_TOTAL_CONNECTIONS	16	Total connections of both new and old Envoy processes.
SIDECAR_PARENT_CONNECTIONS	17	Total connections of the old Envoy process on hot restart.
SIDECAR_WORKER_THREADS	18	Number of worker threads.
SIDECAR_BUG_FAILURES	19	Number of envoy bug failures detected in a release build. File or report the issue if this increments as this may be serious.

MetricDetectionPoint

From which detection point the metric is observed.

Field	Number	Description
INVALID_METRIC_DETECTION_POINT	0
IN_SERVICE	1	Self observability metrics uses in service detect point.
CLIENT_SIDE	2	Client side is how the client is observing the metric. When service A calls service B, service A acts as a client side.
SERVER_SIDE	3	Server side is how the server is observing the metric. When service A calls service B, service B acts as the server side.

MetricOrigin

From where the metric measurements come from.

Field	Number	Description
INVALID_METRIC_ORIGIN	0
MESH_CONTROLLED	1	The metrics origin is from a TSB configured mesh, capturing the metrics from the sidecar's available observability.
AGENT_OBSERVED	2	An agent which can be standalone or service with automatically instrumentation via byte code injection. Currently not available. Part of hybrid observability.
MESH_IMPORTED	3	Other known mesh generated metrics that are not configured and handled by TSB. Currently not available. Part of hybrid observability.
EXTERNAL_IMPORTED	4	External captured metrics that are either imported into TSB observability stack or queried at runtime. Currently not available. Part of hybrid observability.

Type

Field	Number	Description
INVALID_METRIC_TYPE	0
GAUGE	1	Is the last seen measurement over a period of time.
COUNTER	2	Is the sum of number of measurement over a period of time. Used in number of request style of metrics.
AVERAGE	3	Average function applied to the measurements. Used in Duration/latency style of metrics.
PERCENT	4	Percentage function applied to a given observed value over the total observer values. Used in SLA style of metrics, for example the percentage of errored responses over the total server responses.
APDEX	5	Application Performance Index monitors end-user satisfaction. Apdex score
HEATMAPS	6	Heat maps are a three dimensional visualization, using x and y coordinates for two dimensions, and color intensity for the third. They can reveal detail that summary statistics, such as line charts of averages, can miss. Latency measurements can be aggregated using Heatmaps/histograms. One dimension is often time, the other is the latency, and the third one (the intensity) is the frequency of that latency in the given time range.
LABELED_COUNTER	7	Is the sum of number of measurement over time grouped by concrete label values. Used for counting responses by their http response code for instance.
PERCENTILE	8	This is a specific subtype of LABELED_COUNTER. Used in duration/latency style metrics.
CPM	10	Calls per minute used. Used in requests per minute, or in 5xx http errors per minute, 4xx http errors per minute, among other metrics.
MAX	11	Selects the highest measurement over a period of time. Envoy max allocated style metrics.

Measure​

Metric​

MetricType​

Label​

MeshControlledMeasureNames​

MetricDetectionPoint​

MetricOrigin​

Type​

Measure

Metric

MetricType

Label

MeshControlledMeasureNames

MetricDetectionPoint

MetricOrigin

Type