Skip to content

Conversation

rvhost
Copy link
Contributor

@rvhost rvhost commented Jul 6, 2025

πŸ›‘ Envoy Proxy - Persistent Upstream Service Failures

This rule detects when the Envoy proxy repeatedly fails to connect with its upstream (backend) services. It identifies critical service outages by monitoring Envoy's access logs for specific failure patterns.

Why It Matters

When Envoy can't reach its backends, it results in HTTP 503 (Service Unavailable) and 504 (Gateway Timeout) errors for the end-user. These failures directly impact application reliability, degrade user experience, and can trigger cascading failures across other microservices.


Key Failure Indicators

This rule looks for the following signals in Envoy's access logs:

  • HTTP 503 and 504 Response Codes: A high volume of these errors indicates Envoy cannot get a valid response from a backend service.
  • Response Flags:
    • UH: No healthy upstream host was available to serve the request.
    • UT: The request to the upstream service timed out.
    • UO: The request was rejected because the upstream cluster's circuit breaker was tripped (overflow).
    • UF: The request was rejected because of the upstream connection failure.

Common Causes

These failures often point to underlying issues with the backend services, such as:

  • Service Unavailability: The backend service has crashed, is not running, or is misconfigured.
  • Resource Exhaustion: The upstream service is overloaded (high CPU, memory, or connection limits) and cannot accept new requests.
  • Circuit Breaker Trips: Envoy has detected a high rate of errors and has intentionally stopped sending traffic to protect the system.
  • Network Issues: Problems with DNS or network connectivity are preventing Envoy from reaching the service.
  • Misconfigured Health Checks: The health checks are incorrectly marking a healthy service as unavailable.

πŸ“‹ Reproduction Steps

You can simulate this failure scenario to test the detection rule using the provided test environment.

  1. Clone the test repository:

    cd cre-envoy/
  2. Make the test script executable:

    chmod +x test.sh
  3. Run the failure simulation script:
    This script will start Envoy and a backend service, then simulate a failure condition.

    ./test.sh
  4. Execute the detection rule:
    Run the rule against the generated log file to confirm the detection.

    cat logs/test.log | preq -r envoy-upstream-failure.yaml -d

Reproducible Setup (Maintainers invited): [cre-envoy]
Live CRE Detection: [CRE Playground]


Video Link

closes #97
/claim #97

@rvhost
Copy link
Contributor Author

rvhost commented Jul 10, 2025

@tonymeehan this is my first PR on CRE repo, just wanted to give context, i spent 5 days really digging into the envoy and different scenarios. it was a serious deep dive but i think i got there.

would really appreciate an indepth review on the pr when you can so i can learn what to do better. thanks for the chance!

Now this rule will only target the response code key space not the entire logs so that we dont get any false positives.
@rvhost rvhost requested a review from tonymeehan July 10, 2025 18:22
@tonymeehan tonymeehan merged commit d3b0b55 into prequel-dev:main Jul 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Multiple Winners] Envoy: Reproduce A High-Severity Failure & Write a Detection Rule [Submit by July 6 11:59 pm ET]
2 participants