Skip to content

Job failing on GCP but works on AWS #2057

@xtreme-peter-iskandar

Description

@xtreme-peter-iskandar

Bug Report

Issue

When we run our integration tests for our application it frequently (i.e., ~90% of the time) will fail due to what appears to be a networking issue.

Our application uses RSpec/Capybara/Headless Chrome in order to run the integration tests. As part of the tests, we call out to a script to seed the app with data via its API. The architecture from the perspective of the tests is as follows:

media-20180226

The failure mode appears to be that at some point, the API gateway loses the ability to talk to the microservices behind it. There is not an apparent pattern for when this happens - 10% of the time, the tests will complete without error. Sometimes, it will fail on the very first request to write seed data to the API. Other times, it will fail halfway through seeding data.

For example, the API gateway will see a request and map it correctly:

2018-02-23 22:32:22.108 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.web.ZuulHandlerMapping      : Matching patterns for request [/v2/slots] are [/v2/slots/**]
2018-02-23 22:32:22.108 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.web.ZuulHandlerMapping      : URI Template variables for request [/v2/slots] are {}
2018-02-23 22:32:22.108 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.web.ZuulHandlerMapping      : Mapping [/v2/slots] to HandlerExecutionChain with handler [org.springframework.cloud.netflix.zuul.web.ZuulController@18cebaa5] and 1 interceptor
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : Finding route for path: /v2/slots
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : servletPath=/
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : zuulServletPath=/zuul
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : RequestUtils.isDispatcherServletRequest()=true
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : RequestUtils.isZuulServletRequest()=false
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : adjustedPath=/v2/slots
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : Matching pattern:/v2/slots/**
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : route matched=ZuulProperties.ZuulRoute(id=slots, path=/v2/slots/**, serviceId=null, url=http://127.0.0.1:9086, stripPrefix=false, retryable=null, sensitiveHeaders=[], customSensitiveHeaders=false)
2018-02-23 22:32:22.123 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.z.f.r.SimpleHostRoutingFilter    : 127.0.0.1 9086 http
2018-02-23 22:33:22.188  INFO 13229 --- [       Thread-7] ConfigServletWebServerApplicationContext : Closing org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@e056f20: startup date [Fri Feb 23 22:30:02 UTC 2018]; parent: org.springframework.context.annotation.AnnotationConfigApplicationContext@246ae04d

And the corresponding microservice will not see that request being made before it is terminated due to the test suite failing (note tthat he GET to /health is the only request received by the microservice):

2018-02-23 22:31:16.504  INFO 13556 --- [nio-9086-exec-1] i.p.m.timelines.api.RequestTimeLogger    : GET /health: 709 ms
2018-02-23 22:33:22.230  INFO 13556 --- [       Thread-6] ConfigServletWebServerApplicationContext : Closing org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@a4102b8: startup date [Fri Feb 23 22:31:06 UTC 2018]; parent: org.springframework.context.annotation.AnnotationConfigApplicationContext@735f7ae5
2018-02-23 22:33:22.235  INFO 13556 --- [       Thread-6] o.s.c.support.DefaultLifecycleProcessor  : Stopping beans in phase 0

The seeding script will then die with a read timeout:

...
Seeding slots: 2018-02-23 22:32:22 +0000
Creating slot for person 3 on project 1: 2018-02-23 22:32:22 +0000
rake aborted!
Faraday::TimeoutError: Net::ReadTimeout

We run all of the processes on the same container so all networking should be via the loopback interface.

We perform a GET request directly to /health endpoint on each of the microservices while they are booting, in order to determine that they have come up successfully and that the integration test can begin. This seems to work reliably.

What we tried

  • increased worker vm types (8cpu 30Gb)
  • updated ruby versions
  • increased logging on rake call (--trace)
    *ssh'd into the container and curl'd config server before databases were populated to make sure we could hit it from the VM
  • switch "localhost" to 127.0.0.1
  • increased the time out to 2 minutes
  • investigated the target pool and & forwarding port settings on GCP, didn't change anything
  • investigated firewall rules on GCP, nothing was blocking the ports or anything
  • set the priviledged flag to true on container
  • changed port being used to host failing server

Setup

  • Concourse version: 3.8.0
  • Deployment type (BOSH/Docker/binary): BOSH
  • Infrastructure/IaaS: GCP
  • Did this used to work? Works on AWS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions