-
-
Notifications
You must be signed in to change notification settings - Fork 867
Description
Bug Report
Issue
When we run our integration tests for our application it frequently (i.e., ~90% of the time) will fail due to what appears to be a networking issue.
Our application uses RSpec/Capybara/Headless Chrome in order to run the integration tests. As part of the tests, we call out to a script to seed the app with data via its API. The architecture from the perspective of the tests is as follows:
The failure mode appears to be that at some point, the API gateway loses the ability to talk to the microservices behind it. There is not an apparent pattern for when this happens - 10% of the time, the tests will complete without error. Sometimes, it will fail on the very first request to write seed data to the API. Other times, it will fail halfway through seeding data.
For example, the API gateway will see a request and map it correctly:
2018-02-23 22:32:22.108 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.web.ZuulHandlerMapping : Matching patterns for request [/v2/slots] are [/v2/slots/**]
2018-02-23 22:32:22.108 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.web.ZuulHandlerMapping : URI Template variables for request [/v2/slots] are {}
2018-02-23 22:32:22.108 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.web.ZuulHandlerMapping : Mapping [/v2/slots] to HandlerExecutionChain with handler [org.springframework.cloud.netflix.zuul.web.ZuulController@18cebaa5] and 1 interceptor
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator : Finding route for path: /v2/slots
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator : servletPath=/
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator : zuulServletPath=/zuul
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator : RequestUtils.isDispatcherServletRequest()=true
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator : RequestUtils.isZuulServletRequest()=false
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator : adjustedPath=/v2/slots
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator : Matching pattern:/v2/slots/**
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator : route matched=ZuulProperties.ZuulRoute(id=slots, path=/v2/slots/**, serviceId=null, url=http://127.0.0.1:9086, stripPrefix=false, retryable=null, sensitiveHeaders=[], customSensitiveHeaders=false)
2018-02-23 22:32:22.123 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.z.f.r.SimpleHostRoutingFilter : 127.0.0.1 9086 http
2018-02-23 22:33:22.188 INFO 13229 --- [ Thread-7] ConfigServletWebServerApplicationContext : Closing org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@e056f20: startup date [Fri Feb 23 22:30:02 UTC 2018]; parent: org.springframework.context.annotation.AnnotationConfigApplicationContext@246ae04d
And the corresponding microservice will not see that request being made before it is terminated due to the test suite failing (note tthat he GET to /health is the only request received by the microservice):
2018-02-23 22:31:16.504 INFO 13556 --- [nio-9086-exec-1] i.p.m.timelines.api.RequestTimeLogger : GET /health: 709 ms
2018-02-23 22:33:22.230 INFO 13556 --- [ Thread-6] ConfigServletWebServerApplicationContext : Closing org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@a4102b8: startup date [Fri Feb 23 22:31:06 UTC 2018]; parent: org.springframework.context.annotation.AnnotationConfigApplicationContext@735f7ae5
2018-02-23 22:33:22.235 INFO 13556 --- [ Thread-6] o.s.c.support.DefaultLifecycleProcessor : Stopping beans in phase 0
The seeding script will then die with a read timeout:
...
Seeding slots: 2018-02-23 22:32:22 +0000
Creating slot for person 3 on project 1: 2018-02-23 22:32:22 +0000
rake aborted!
Faraday::TimeoutError: Net::ReadTimeout
We run all of the processes on the same container so all networking should be via the loopback interface.
We perform a GET request directly to /health endpoint on each of the microservices while they are booting, in order to determine that they have come up successfully and that the integration test can begin. This seems to work reliably.
What we tried
- increased worker vm types (8cpu 30Gb)
- updated ruby versions
- increased logging on rake call (--trace)
*ssh'd into the container and curl'd config server before databases were populated to make sure we could hit it from the VM - switch "localhost" to 127.0.0.1
- increased the time out to 2 minutes
- investigated the target pool and & forwarding port settings on GCP, didn't change anything
- investigated firewall rules on GCP, nothing was blocking the ports or anything
- set the priviledged flag to true on container
- changed port being used to host failing server
Setup
- Concourse version: 3.8.0
- Deployment type (BOSH/Docker/binary): BOSH
- Infrastructure/IaaS: GCP
- Did this used to work? Works on AWS