Skip to content

Checks/tasks are failing intermittently for git and docker resources - containers has no internet access #2200

@pontusarfwedson

Description

@pontusarfwedson

We are running Concourse 3.9 on Kubernetes using the official helm chart with a setup of 2 workers. Since a few hours our tasks and checks are failing intermittently. When checking our git resources we often (but not always) get:

resource script '/opt/resource/check []' failed: exit status 128

stderr:
Identity added: /tmp/git-resource-private-key (/tmp/git-resource-private-key)
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

and sometimes also

resource script '/opt/resource/check []' failed: exit status 137

Inside our tasks (when they sometimes succeed in being triggered) we can get the same failure for checking the git resource but often also get


Error response from daemon: Get https://<our-docker-registry>: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Furthermore, we now also have some checks that appear stuck in the sense that they report checked 45m 54s ago even though they should run every minute.

Looking at the worker logs we can see things like

{"timestamp":"1525762694.813524485","source":"guardian","message":"guardian.api.garden-server.attach.failed","log_level":2,"data":{"error":"does not exist","handle":"7f35e2e7-dca2-48c5-5af0-9027a92cc53c","session":"3.1.15"}}

as well as

{"timestamp":"1525762730.415373564","source":"tsa","message":"tsa.connection.channel.forward-worker.register.start","log_level":1,"data":{"remote":"10.240.0.223:47690","session":"171.1.1.5","worker-address":"10.240.0.10:41187","worker-platform":"linux","worker-tags":""}}
{"timestamp":"1525762730.415424109","source":"guardian","message":"guardian.create-global-iptables-chains.create-started","log_level":1,"data":{"session":"2"}}
2018/05/08 06:58:50 failed to forward remote connection: dial tcp 127.0.0.1:7777: getsockopt: connection refused
{"timestamp":"1525762730.417540312","source":"tsa","message":"tsa.connection.channel.forward-worker.register.failed-to-fetch-containers","log_level":2,"data":{"error":"Get http://api/containers: EOF","remote":"10.240.0.223:47690","session":"171.1.1.5"}}
{"timestamp":"1525762730.420288801","source":"tsa","message":"tsa.connection.channel.forward-worker.register.failed-to-reach-worker","log_level":1,"data":{"baggageclaim-took":"2.558173ms","garden-took":"2.020758ms","remote":"10.240.0.223:47690","session":"171.1.1.5"}}

Have tried a pipeline which uses no credentials from Vault but this gets the same issues as well

Running fly -t target intercept c pipeline/resource and then doing a ping www.google.com successfully resolves the ip address but cannot at all connect and shows 100% packet loss after being interrupted.

bash-4.4# ping www.google.com
PING www.google.com (172.217.20.100): 56 data bytes
^C
--- www.google.com ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

traceroute furthermore shows it doesn't get past the first router

bash-4.4# traceroute www.google.com
traceroute to www.google.com (172.217.20.100), 30 hops max, 46 byte packets
 1  10.254.0.209 (10.254.0.209)  0.014 ms  0.012 ms  0.010 ms
 2  *  *  *
 3  *  *^C

Can this be tied to any known issues and are there any suggestions for resolving this?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions