Skip to main content
logoTetrate Service BridgeVersion: 1.5.x

Metric

A metric is a measurement about a service, captured at runtime. Logically, the moment of capturing one of these measurements is known as a metric event which consists not only of the measurement itself, but the time that it was captured and associated metadata..

The key aspects of a metric are the measure, the metric type, the metric origin, and the metric detect point:

  • The measure describes the type and unit of a metric event also known as measurement.
  • The metric type is the aggregation over time applied to the measurements.
  • The metric origin tells from where the metric measurements come from.
  • The detect point is the point from which the metric is observed, in service, server side, or client side. It is useful to differentiate between metrics that observe a concrete service (often self observing), or metrics that focus on service to service communications.

An TSB controlled (is part of the mesh and has a proxy we can configure) service has several metrics available which leverages a consistent monitoring of services. Some of them cover what is known as the RED metrics set, which are a set of very useful metrics for HTTP/RPC request based services. RED stands for:

  • Rate (R): The number of requests per second.
  • Errors (E): The number of failed requests.
  • Duration (D): The amount of time to process a request.

To understand a bit better which metrics are available given a concrete telemetry source, let's assume we have deployed the classic Istio bookinfo demo application. Let's see some RED based metrics available for an observed and managed service by TSB, for instance the review service using the GLOBAL scoped telemetry source.

The following metric is the number of request per minute that the reviews service is handling at a GLOBAL scope:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews
name: service_cpm
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: REQUESTS
unit: "\{request\}"
metricType:
type: CPM
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE

The metric for the average duration of the handled request by the reviews service at a GLOBAL scope:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews
name: service_resp_time
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: LATENCY
unit: ms
metricType:
type: AVERAGE
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE

The metric for the errors of the handled request by the reviews at a GLOBAL scope. In this case the number of errors are expresses as a percentage of the total number of handled requests:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews
name: service_sla
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: STATUS
unit: NUMBER
metricType:
type: PERCENT
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE

Using a different telemetry source for the same metric will gives a different view of the same observed measurements. For instance, if we want to know how many requests per minute subset v1 from the reviews is handling, we need to use the same metric but from a different telemetry source, in this case reviews-v1:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews-v1
name: service_cpm
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: REQUESTS
unit: NUMBER
metricType:
type: CPM
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE

The duration or latency measurements can also be aggregated in different percentiles over time. The duration percentiles for the handled request by the reviews at a GLOBAL scope:

apiVersion: observability.telemetry.tsb.tetrate.io/v2
kind: Metric
metadata:
organization: myorg
service: reviews.bookinfo
source: reviews
name: service_percentile
spec:
observedResource: organizations/myorg/services/reviews.bookinfo
measure:
type: LATENCY
unit: ms
metricType:
type: PERCENTILE
labels:
- key: "0"
value: "p50"
- key: "1"
value: "p75"
- key: "2"
value: "p90"
- key: "3"
value: "p05"
- key: "4"
value: "p99"
origin: MESH_OBSERVED
detectPoint: SERVER_SIDE

Measure

A measure represents the name and unit of a measurement. For example, request latency in ms and the number of errors are examples of measures to collect from a server. In this case latency would be the type and ms (millisecond) is the unit.

FieldDescriptionValidation Rule

name

string
The name of the measure. For instance latency in ms. More reference values can be found at MeshControlledMeasureNames.

unit

string
The unit of measure, which follow the unified code for units of measure. For COUNTABLE measures, as number of requests or network packets, SHOULD use the default unit, the unity, and annotations with curly braces to give additional meaning. For example {requests}, {packets}, {errors}, {faults}, etc.

Metric

A metric is a measurement about a service, captured at runtime. Logically, the moment of capturing one of these measurements is known as a metric event which consists not only of the measurement itself, but the time that it was captured and associated metadata.

Application and request metrics are important indicators of availability and performance. Custom metrics can provide insights into how availability indicators impact user experience or the business. Collected data can be used to alert of an outage or trigger scheduling decisions to scale up a deployment automatically upon high demand.

FieldDescriptionValidation Rule

observedResource

string
OUTPUT_ONLY
Which concrete TSB resource in the configuration hierarchy this metric observes and belongs to. For instance, a metric can observe a service, a concrete service workload (pod or Vm), or a gateway, or a workspace, or any other resource in the configuration hierarchy.

measure

tetrateio.api.tsb.observability.telemetry.v2.Measure
Measure describes the name and unit of a metric event also know as measurement.

type

tetrateio.api.tsb.observability.telemetry.v2.MetricType
The type of aggregation over time applied to the measurements.

origin

tetrateio.api.tsb.observability.telemetry.v2.MetricOrigin
From where the metric measurements come from.

detectionPoint

tetrateio.api.tsb.observability.telemetry.v2.MetricDetectionPoint
From which detection point the metric is observed, server side or client side. It is useful to differentiate between metrics that observe a concrete service (often self observing), or metrics that focus on service to service communications. In service to service observed metrics, the observation can be done at the client or the server side.

MetricType

Metric types are the aggregation function applied to the measurements that took place over a period of time. Some metric types like LABELED_COUNTER and PERCENTILE also additionally aggregated over the set of defined labels.

FieldDescriptionValidation Rule

name

tetrateio.api.tsb.observability.telemetry.v2.MetricType.Type
The type of metric

labels

List of tetrateio.api.tsb.observability.telemetry.v2.MetricType.Label
The labels associated with the metric type. Some aggregation function are not just applied over time. LABELED_COUNTER and PERCENTILE metric types also aggregate over their labels. For instance, a PERCENTILE metric type over the latency, will aggregate the measured latency over the different defined percentiles, p50, p75, p90, p95, and p99.

Label

Label of metric type. Also seen a other dimensions of aggregation besides the time interval on which measurements are aggregated over.

FieldDescriptionValidation Rule

key

string
The label key.

value

string
The label value, for instance p50, or p75.

MeshControlledMeasureNames

The name of measures available for a controlled service in the mesh.

FieldNumberDescription

INVALID_MEASURE_TYPE

0

COUNTABLE

1

Represents discrete instances of a countable quantity. And integer count of something SHOULD use the default unit, the unity. Countable is a generalized measure name that can be used for many common countable quantities. Because of the generalized name, annotations with curly braces to give additional meaning. Networks packets, system paging faults are countable measures examples.

REQUESTS

2

Requests is a specialized countable measure that represents the number of requests.

LATENCY

3

The time taken by each request.

STATUS

4

The success or failure of a request.

HTTP_RESPONSE_CODE

5

The response code of the HTTP response, and if this request is the HTTP call. E.g. 200, 404, 302

RPC_RESPONSE_CODE

6

The value of the rpc response code.

SIDECAR_INTERNAL_ERROR_CODE

7

The sidecar/gateway proxy internal error code. The value is based on the implementation.

SIDECAR_RETRY_EXCEEDED

8

The sidecar/gateway proxy internal error code. The value is based on the implementation.

TCP_INFO_RECEIVED_BYTES

9

The received bytes of the TCP traffic, if this request is a TCP call.

TCP_INFO_SEND_BYTES

10

The sent bytes of the TCP traffic, if this request is a TCP call.

MTLS_IN_USE

11

If mutual tls is in use in the connections between services.

SIDECAR_HEAP_MEMORY_USED

12

Current reserved heap size in bytes. New Envoy process heap size on hot restart.

SIDECAR_MEMORY_ALLOCATED

14

Current amount of allocated memory in bytes. Total of both new and old Envoy processes on hot restart.

SIDECAR_PHYSICAL_MEMORY

15

Current estimate of total bytes of the physical memory. New Envoy process physical memory size on hot restart.

SIDECAR_TOTAL_CONNECTIONS

16

Total connections of both new and old Envoy processes.

SIDECAR_PARENT_CONNECTIONS

17

Total connections of the old Envoy process on hot restart.

SIDECAR_WORKER_THREADS

18

Number of worker threads.

SIDECAR_BUG_FAILURES

19

Number of envoy bug failures detected in a release build. File or report the issue if this increments as this may be serious.

MetricDetectionPoint

From which detection point the metric is observed.

FieldNumberDescription

INVALID_METRIC_DETECTION_POINT

0

IN_SERVICE

1

Self observability metrics uses in service detect point.

CLIENT_SIDE

2

Client side is how the client is observing the metric. When service A calls service B, service A acts as a client side.

SERVER_SIDE

3

Server side is how the server is observing the metric. When service A calls service B, service B acts as the server side.

MetricOrigin

From where the metric measurements come from.

FieldNumberDescription

INVALID_METRIC_ORIGIN

0

MESH_CONTROLLED

1

The metrics origin is from a TSB configured mesh, capturing the metrics from the sidecar's available observability.

AGENT_OBSERVED

2

An agent which can be standalone or service with automatically instrumentation via byte code injection. Currently not available. Part of hybrid observability.

MESH_IMPORTED

3

Other known mesh generated metrics that are not configured and handled by TSB. Currently not available. Part of hybrid observability.

EXTERNAL_IMPORTED

4

External captured metrics that are either imported into TSB observability stack or queried at runtime. Currently not available. Part of hybrid observability.

Type

FieldNumberDescription

INVALID_METRIC_TYPE

0

GAUGE

1

Is the last seen measurement over a period of time.

COUNTER

2

Is the sum of number of measurement over a period of time. Used in number of request style of metrics.

AVERAGE

3

Average function applied to the measurements. Used in Duration/latency style of metrics.

PERCENT

4

Percentage function applied to a given observed value over the total observer values. Used in SLA style of metrics, for example the percentage of errored responses over the total server responses.

APDEX

5

Application Performance Index monitors end-user satisfaction. Apdex score

HEATMAPS

6

Heat maps are a three dimensional visualization, using x and y coordinates for two dimensions, and color intensity for the third. They can reveal detail that summary statistics, such as line charts of averages, can miss. Latency measurements can be aggregated using Heatmaps/histograms. One dimension is often time, the other is the latency, and the third one (the intensity) is the frequency of that latency in the given time range.

LABELED_COUNTER

7

Is the sum of number of measurement over time grouped by concrete label values. Used for counting responses by their http response code for instance.

PERCENTILE

8

This is a specific subtype of LABELED_COUNTER. Used in duration/latency style metrics.

CPM

10

Calls per minute used. Used in requests per minute, or in 5xx http errors per minute, 4xx http errors per minute, among other metrics.

MAX

11

Selects the highest measurement over a period of time. Envoy max allocated style metrics.