Troubleshooting for the Telemetry Module

Troubleshoot problems related to the Telemetry module and its pipelines.

No Data Arrive at the Backend

Symptom

No data arrive at the backend.
In the respective pipeline status, the TelemetryFlowHealthy condition has status GatewayAllTelemetryDataDropped or AgentAllTelemetryDataDropped.

Cause

Authentication Error: The credentials in your MetricPipeline output are incorrect.
Network Unreachable: The backend URL is wrong, a firewall is blocking the connection, or there's a DNS issue preventing the agent or gateway from reaching the backend.
Backend is Down: The observability backend itself is not running or is unhealthy.

Solution

Identify the failing component.
- If the status is GatewayAllTelemetryDataDropped, the problem is with the gateway.
- If the status is AgentAllTelemetryDataDropped, the problem is with the agent.
To check the failing component's logs, call kubectl logs -n kyma-system {POD_NAME}:
- For the gateway, check Pod telemetry-(log|trace|metric)-gateway.
- For the agent, check Pod telemetry-(log|metric)-agent. Look for errors related to authentication, connectivity, and DNS.
Check if the backend is up and reachable.
Based on the log messages, fix the output section of your pipeline and re-apply it.

Not All Data Arrive at the Backend

Symptom

The backend is reachable and the connection is properly configured, but some data points are refused.
In the pipeline status, the TelemetryFlowHealthy condition has status GatewaySomeTelemetryDataDropped or AgentSomeTelemetryDataDropped.

Cause

This status indicates that the telemetry gateway or agent is successfully sending data, but the backend is rejecting some of it. Common reasons are:

Rate Limiting: Your backend is rejecting requests because you're sending too much data at once.
Invalid Data: Your backend is rejecting specific data due to incorrect formatting, invalid labels, or other schema violations.

Solution

Check the error logs for the affected Pod by calling kubectl logs -n kyma-system {POD_NAME}:
- For GatewaySomeTelemetryDataDropped, check Pod telemetry-(log|trace|metric)-gateway.
- For AgentSomeTelemetryDataDropped, check Pod telemetry-(log|metric)-agent.
Go to your observability backend and investigate potential causes.
If the backend is limiting the rate by refusing data, try the following options:
- Increase the ingestion rate of your backend (for example, by scaling out your SAP Cloud Logging instances).
- Reduce emitted data by re-configuring the pipeline (for example, by disabling certain inputs or applying filters).
- Reduce emitted data in your applications.
Otherwise, fix the issues as indicated in the logs.

Gateway Throttling

Symptom

In the pipeline status, the TelemetryFlowHealthy condition has status GatewayThrottling.

Cause

The gateway is receiving data faster than it can process and forward it.

Solution

Manually scale out the capacity by increasing the number of replicas for the affected gateway. For details, see Telemetry CRD.

Custom Spans Don’t Arrive at the Backend, but Istio Spans Do

Symptom

You see traces generated by the Istio service mesh, but traces from your own application code (custom spans) are missing.

Cause

The OpenTelemetry (OTel) SDK version used in your application is incompatible with the OTel Collector version.

Solution

Check which SDK version you're using for instrumentation.
Investigate whether it's compatible with the OTel Collector version.
If necessary, upgrade to a supported SDK version.

Observability Backend Shows Fewer Traces than Expected

Symptom

The observability backend shows significantly fewer traces than the number of requests your application receives.

Cause

By default, Istio samples only 1% of requests for tracing to minimize performance overhead (see Configure Istio Tracing).

For example, in low-traffic environments (for development or testing) or for low-traffic services, the request volume can be so low that a 1% sample rate may result in capturing zero traces.

Solution

To see more traces in the trace backend, increase the percentage of requests that are sampled (see Configure the Sampling Rate).
Alternatively, to trace a single request, force sampling by adding a traceparent HTTP header to your client request. This header contains a sampled flag that instructs the system to capture the trace, bypassing the global sampling rate (see Trace Context: Sampled Flag).

MetricPipeline: Failed to Scrape Prometheus Endpoint

Symptom

Custom metrics don't arrive at the destination.

The OTel Collector produces log entries saying "Failed to scrape Prometheus endpoint", such as the following example:

bash

2023-08-29T09:53:07.123Z warn internal/transaction.go:111 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus/app-pods", "data_type": "metrics", "scrape_timestamp": 1693302787120, "target_labels": "{__name__=\"up\", instance=\"10.42.0.18:8080\", job=\"app-pods\"}"}

Cause

There's a configuration or network issue between the metric agent and your application, such as:

The Service that exposes your metrics port doesn't specify the application protocol.
The workload is not configured to use STRICT mTLS mode, which the metric agent uses by default.
A deny-all NetworkPolicy in your application's namespace prevents the agent from scraping metrics from annotated workloads.

Solution

Define the application protocol in the Service port definition by either prefixing the port name with the protocol, or define the appProtocol attribute.
If the issue is with mTLS, either configure your workload to use STRICT mTLS, or switch to unencrypted scraping by adding the prometheus.io/scheme: "http" annotation to your workload.

Create a new NetworkPolicy to explicitly allow ingress traffic from the metric agent; such as the following example:

yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-traffic-from-agent
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: "annotated-workload" # <your workload here>
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kyma-system
      podSelector:
        matchLabels:
          telemetry.kyma-project.io/metric-scrape: "true"
  policyTypes:
  - Ingress

LogPipeline: Log Buffer Filling Up

Symptom

In the LogPipeline status, the TelemetryFlowHealthy condition has status AgentBufferFillingUp.

Cause

The backend ingestion rate is too low compared to the export rate of the log agent, causing data to accumulate in its buffer.

Solution

You can either increase the capacity of your backend or reduce the volume of log data being sent. Try one of the following options:

Increase the ingestion rate of your backend (for example, by scaling out your SAP Cloud Logging instances).
Reduce emitted data by re-configuring the pipeline (for example, by disabling certain inputs or applying namespace filters).
Reduce the amount of log data generated by your applications.

Istio Service Mesh

Tutorials

Technical Reference

Troubleshooting

Tutorials

Expose a Workload

Use APIRule v2

Use APIRule v2alpha1

Use APIRule v1beta1

Expose and Secure a Workload

Use APIRule v2

Use APIRule v2alpha1

Use APIRule v1beta1

Security

Custom Resources

APIGateway Custom Resource

APIRule Custom Resource

v2

v2alpha1

v1beta1

APIRule Migration

Technical Reference

Troubleshooting Guides

APIRule and Service Connection Issues

APIRule v2

APIRule v2alpha1

APIRule v1beta1

External DNS Management Errors

APIRule v2 Introduction

Resources

Tutorials

Technical Reference

Runtime Agent

Tutorials

Resources

Tutorials

Register a Service

VPC Peering

Resources

Tutorials

Tutorials

Resources

Resources

Troubleshooting

Tutorials

Resources

Technical Reference

Troubleshooting Guides

Collecting Logs

Collecting Traces

Collecting Metrics

Filtering and Processing Data

Integrate with your OTLP Backend

Architecture

Integration Guides

Resources

Tutorials

Resources

Technical Reference

Tutorials

Commands

Troubleshooting for the Telemetry Module ​

No Data Arrive at the Backend ​

Symptom ​

Cause ​

Solution ​

Not All Data Arrive at the Backend ​

Symptom ​

Cause ​

Solution ​

Gateway Throttling ​

Symptom ​

Cause ​

Solution ​

Custom Spans Don’t Arrive at the Backend, but Istio Spans Do ​

Symptom ​

Cause ​

Solution ​

Observability Backend Shows Fewer Traces than Expected ​

Symptom ​

Troubleshooting for the Telemetry Module

No Data Arrive at the Backend

Symptom

Cause

Solution

Not All Data Arrive at the Backend

Symptom

Cause

Solution

Gateway Throttling

Symptom

Cause

Solution

Custom Spans Don’t Arrive at the Backend, but Istio Spans Do

Symptom

Cause

Solution

Observability Backend Shows Fewer Traces than Expected

Symptom

Cause

Solution

MetricPipeline: Failed to Scrape Prometheus Endpoint

Symptom

Cause

Solution

LogPipeline: Log Buffer Filling Up

Symptom

Cause

Solution