Envoy Proxy β Persistent Upstream Service Failures #111
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
π Envoy Proxy - Persistent Upstream Service Failures
This rule detects when the Envoy proxy repeatedly fails to connect with its upstream (backend) services. It identifies critical service outages by monitoring Envoy's access logs for specific failure patterns.
Why It Matters
When Envoy can't reach its backends, it results in HTTP
503
(Service Unavailable) and504
(Gateway Timeout) errors for the end-user. These failures directly impact application reliability, degrade user experience, and can trigger cascading failures across other microservices.Key Failure Indicators
This rule looks for the following signals in Envoy's access logs:
503
and504
Response Codes: A high volume of these errors indicates Envoy cannot get a valid response from a backend service.UH
: No healthy upstream host was available to serve the request.UT
: The request to the upstream service timed out.UO
: The request was rejected because the upstream cluster's circuit breaker was tripped (overflow).UF
: The request was rejected because of the upstream connection failure.Common Causes
These failures often point to underlying issues with the backend services, such as:
π Reproduction Steps
You can simulate this failure scenario to test the detection rule using the provided test environment.
Clone the test repository:
cd cre-envoy/
Make the test script executable:
Run the failure simulation script:
This script will start Envoy and a backend service, then simulate a failure condition.
Execute the detection rule:
Run the rule against the generated log file to confirm the detection.
cat logs/test.log | preq -r envoy-upstream-failure.yaml -d
Reproducible Setup (Maintainers invited): [cre-envoy]
Live CRE Detection: [CRE Playground]
Video Link
closes #97
/claim #97