Job failing on GCP but works on AWS

# Bug Report

## Issue

When we run our integration tests for our application it frequently (i.e., ~90% of the time) will fail due to what appears to be a networking issue.

Our application uses RSpec/Capybara/Headless Chrome in order to run the integration tests. As part of the tests, we call out to a script to seed the app with data via its API. The architecture from the perspective of the tests is as follows:

![media-20180226](https://user-images.githubusercontent.com/1062824/36677592-9290fb00-1adc-11e8-85c1-b19d33bd5aeb.png)

The failure mode appears to be that at some point, the API gateway loses the ability to talk to the microservices behind it. There is not an apparent pattern for when this happens - 10% of the time, the tests will complete without error. Sometimes, it will fail on the very first request to write seed data to the API. Other times, it will fail halfway through seeding data. 

For example, the API gateway will see a request and map it correctly:

```
2018-02-23 22:32:22.108 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.web.ZuulHandlerMapping      : Matching patterns for request [/v2/slots] are [/v2/slots/**]
2018-02-23 22:32:22.108 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.web.ZuulHandlerMapping      : URI Template variables for request [/v2/slots] are {}
2018-02-23 22:32:22.108 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.web.ZuulHandlerMapping      : Mapping [/v2/slots] to HandlerExecutionChain with handler [org.springframework.cloud.netflix.zuul.web.ZuulController@18cebaa5] and 1 interceptor
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : Finding route for path: /v2/slots
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : servletPath=/
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : zuulServletPath=/zuul
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : RequestUtils.isDispatcherServletRequest()=true
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : RequestUtils.isZuulServletRequest()=false
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : adjustedPath=/v2/slots
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : Matching pattern:/v2/slots/**
2018-02-23 22:32:22.122 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.zuul.filters.SimpleRouteLocator  : route matched=ZuulProperties.ZuulRoute(id=slots, path=/v2/slots/**, serviceId=null, url=http://127.0.0.1:9086, stripPrefix=false, retryable=null, sensitiveHeaders=[], customSensitiveHeaders=false)
2018-02-23 22:32:22.123 DEBUG 13229 --- [nio-9084-exec-8] o.s.c.n.z.f.r.SimpleHostRoutingFilter    : 127.0.0.1 9086 http
2018-02-23 22:33:22.188  INFO 13229 --- [       Thread-7] ConfigServletWebServerApplicationContext : Closing org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@e056f20: startup date [Fri Feb 23 22:30:02 UTC 2018]; parent: org.springframework.context.annotation.AnnotationConfigApplicationContext@246ae04d
```

And the corresponding microservice will not see that request being made before it is terminated due to the test suite failing (note tthat he GET to /health is the only request received by the microservice):

```
2018-02-23 22:31:16.504  INFO 13556 --- [nio-9086-exec-1] i.p.m.timelines.api.RequestTimeLogger    : GET /health: 709 ms
2018-02-23 22:33:22.230  INFO 13556 --- [       Thread-6] ConfigServletWebServerApplicationContext : Closing org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@a4102b8: startup date [Fri Feb 23 22:31:06 UTC 2018]; parent: org.springframework.context.annotation.AnnotationConfigApplicationContext@735f7ae5
2018-02-23 22:33:22.235  INFO 13556 --- [       Thread-6] o.s.c.support.DefaultLifecycleProcessor  : Stopping beans in phase 0
```
The seeding script will then die with a read timeout:

```
...
Seeding slots: 2018-02-23 22:32:22 +0000
Creating slot for person 3 on project 1: 2018-02-23 22:32:22 +0000
rake aborted!
Faraday::TimeoutError: Net::ReadTimeout
```

We run all of the processes on the same container so all networking should be via the loopback interface.

We perform a GET request directly to /health endpoint on each of the microservices while they are booting, in order to determine that they have come up successfully and that the integration test can begin. This seems to work reliably.

## What we tried
* increased worker vm types (8cpu 30Gb)
* updated ruby versions
* increased logging on rake call (--trace)
*ssh'd into the container and curl'd config server before databases were populated to make sure we could hit it from the VM
* switch "localhost" to 127.0.0.1
* increased the time out to 2 minutes
* investigated the target pool and & forwarding port settings on GCP, didn't change anything
* investigated firewall rules on GCP, nothing was blocking the ports or anything
* set the priviledged flag to true on container
* changed port being used to host failing server

## Setup
* Concourse version: 3.8.0
* Deployment type (BOSH/Docker/binary): BOSH
* Infrastructure/IaaS: GCP
* Did this used to work? Works on AWS


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Job failing on GCP but works on AWS #2057

Bug Report

Issue

What we tried

Setup

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Job failing on GCP but works on AWS #2057

Description

Bug Report

Issue

What we tried

Setup

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions