Troubleshooting for the Telemetry Module ​
Troubleshoot problems related to the Telemetry module and its pipelines.
No Data Arrive at the Backend ​
Symptom ​
- No data arrive at the backend.
- In the respective pipeline status, the
TelemetryFlowHealthycondition has status GatewayAllTelemetryDataDropped or AgentAllTelemetryDataDropped.
Cause ​
- Authentication Error: The credentials in your MetricPipeline output are incorrect.
- Network Unreachable: The backend URL is wrong, a firewall is blocking the connection, or there's a DNS issue preventing the agent or gateway from reaching the backend.
- Backend is Down: The observability backend itself is not running or is unhealthy.
Solution ​
- Identify the failing component.
- If the status is
GatewayAllTelemetryDataDropped, the problem is with the gateway. - If the status is
AgentAllTelemetryDataDropped, the problem is with the agent.
- If the status is
- To check the failing component's logs, call
kubectl logs -n kyma-system {POD_NAME}:- For the gateway, check Pod
telemetry-(log|trace|metric)-gateway. - For the agent, check Pod
telemetry-(log|metric)-agent. Look for errors related to authentication, connectivity, and DNS.
- For the gateway, check Pod
- Check if the backend is up and reachable.
- Based on the log messages, fix the output section of your pipeline and re-apply it.
Not All Data Arrive at the Backend ​
Symptom ​
- The backend is reachable and the connection is properly configured, but some data points are refused.
- In the pipeline status, the
TelemetryFlowHealthycondition has status GatewaySomeTelemetryDataDropped or AgentSomeTelemetryDataDropped.
Cause ​
This status indicates that the telemetry gateway or agent is successfully sending data, but the backend is rejecting some of it. Common reasons are:
- Rate Limiting: Your backend is rejecting requests because you're sending too much data at once.
- Invalid Data: Your backend is rejecting specific data due to incorrect formatting, invalid labels, or other schema violations.
Solution ​
- Check the error logs for the affected Pod by calling
kubectl logs -n kyma-system {POD_NAME}:- For GatewaySomeTelemetryDataDropped, check Pod
telemetry-(log|trace|metric)-gateway. - For AgentSomeTelemetryDataDropped, check Pod
telemetry-(log|metric)-agent.
- For GatewaySomeTelemetryDataDropped, check Pod
- Go to your observability backend and investigate potential causes.
- If the backend is limiting the rate by refusing data, try the following options:
- Increase the ingestion rate of your backend (for example, by scaling out your SAP Cloud Logging instances).
- Reduce emitted data by re-configuring the pipeline (for example, by disabling certain inputs or applying filters).
- Reduce emitted data in your applications.
- Otherwise, fix the issues as indicated in the logs.
Gateway Throttling ​
Symptom ​
In the pipeline status, the TelemetryFlowHealthy condition has status GatewayThrottling.
Cause ​
The gateway is receiving data faster than it can process and forward it.
Solution ​
Manually scale out the capacity by increasing the number of replicas for the affected gateway. For details, see Telemetry CRD.
Custom Spans Don’t Arrive at the Backend, but Istio Spans Do ​
Symptom ​
You see traces generated by the Istio service mesh, but traces from your own application code (custom spans) are missing.
Cause ​
The OpenTelemetry (OTel) SDK version used in your application is incompatible with the OTel Collector version.
Solution ​
- Check which SDK version you're using for instrumentation.
- Investigate whether it's compatible with the OTel Collector version.
- If necessary, upgrade to a supported SDK version.
Observability Backend Shows Fewer Traces than Expected ​
Symptom ​
The observability backend shows significantly fewer traces than the number of requests your application receives.
Cause ​
By default, Istio samples only 1% of requests for tracing to minimize performance overhead (see Configure Istio Tracing).
For example, in low-traffic environments (for development or testing) or for low-traffic services, the request volume can be so low that a 1% sample rate may result in capturing zero traces.
Solution ​
To see more traces in the trace backend, increase the percentage of requests that are sampled (see Configure the Sampling Rate).
Alternatively, to trace a single request, force sampling by adding a traceparent HTTP header to your client request. This header contains a sampled flag that instructs the system to capture the trace, bypassing the global sampling rate (see Trace Context: Sampled Flag).
MetricPipeline: Failed to Scrape Prometheus Endpoint ​
Symptom ​
Custom metrics don't arrive at the destination.
The OTel Collector produces log entries saying "Failed to scrape Prometheus endpoint", such as the following example:
bash2023-08-29T09:53:07.123Z warn internal/transaction.go:111 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus/app-pods", "data_type": "metrics", "scrape_timestamp": 1693302787120, "target_labels": "{__name__=\"up\", instance=\"10.42.0.18:8080\", job=\"app-pods\"}"}
Cause ​
There's a configuration or network issue between the metric agent and your application, such as:
- The Service that exposes your metrics port doesn't specify the application protocol.
- The workload is not configured to use STRICT mTLS mode, which the metric agent uses by default.
- A deny-all NetworkPolicy in your application's namespace prevents the agent from scraping metrics from annotated workloads.
Solution ​
Define the application protocol in the Service port definition by either prefixing the port name with the protocol, or define the appProtocol attribute.
If the issue is with mTLS, either configure your workload to use STRICT mTLS, or switch to unencrypted scraping by adding the prometheus.io/scheme: "http" annotation to your workload.
Create a new NetworkPolicy to explicitly allow ingress traffic from the metric agent; such as the following example:
yamlapiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-traffic-from-agent spec: podSelector: matchLabels: app.kubernetes.io/name: "annotated-workload" # <your workload here> ingress: - from: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: kyma-system podSelector: matchLabels: telemetry.kyma-project.io/metric-scrape: "true" policyTypes: - Ingress
LogPipeline: Log Buffer Filling Up ​
Symptom ​
In the LogPipeline status, the TelemetryFlowHealthy condition has status AgentBufferFillingUp.
Cause ​
The backend ingestion rate is too low compared to the export rate of the log agent, causing data to accumulate in its buffer.
Solution ​
You can either increase the capacity of your backend or reduce the volume of log data being sent. Try one of the following options:
- Increase the ingestion rate of your backend (for example, by scaling out your SAP Cloud Logging instances).
- Reduce emitted data by re-configuring the pipeline (for example, by disabling certain inputs or applying namespace filters).
- Reduce the amount of log data generated by your applications.