-
-
Notifications
You must be signed in to change notification settings - Fork 867
Description
We are running Concourse 3.9 on Kubernetes using the official helm chart with a setup of 2 workers. Since a few hours our tasks and checks are failing intermittently. When checking our git resources we often (but not always) get:
resource script '/opt/resource/check []' failed: exit status 128
stderr:
Identity added: /tmp/git-resource-private-key (/tmp/git-resource-private-key)
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
and sometimes also
resource script '/opt/resource/check []' failed: exit status 137
Inside our tasks (when they sometimes succeed in being triggered) we can get the same failure for checking the git resource but often also get
Error response from daemon: Get https://<our-docker-registry>: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Furthermore, we now also have some checks that appear stuck in the sense that they report checked 45m 54s ago
even though they should run every minute.
Looking at the worker logs we can see things like
{"timestamp":"1525762694.813524485","source":"guardian","message":"guardian.api.garden-server.attach.failed","log_level":2,"data":{"error":"does not exist","handle":"7f35e2e7-dca2-48c5-5af0-9027a92cc53c","session":"3.1.15"}}
as well as
{"timestamp":"1525762730.415373564","source":"tsa","message":"tsa.connection.channel.forward-worker.register.start","log_level":1,"data":{"remote":"10.240.0.223:47690","session":"171.1.1.5","worker-address":"10.240.0.10:41187","worker-platform":"linux","worker-tags":""}}
{"timestamp":"1525762730.415424109","source":"guardian","message":"guardian.create-global-iptables-chains.create-started","log_level":1,"data":{"session":"2"}}
2018/05/08 06:58:50 failed to forward remote connection: dial tcp 127.0.0.1:7777: getsockopt: connection refused
{"timestamp":"1525762730.417540312","source":"tsa","message":"tsa.connection.channel.forward-worker.register.failed-to-fetch-containers","log_level":2,"data":{"error":"Get http://api/containers: EOF","remote":"10.240.0.223:47690","session":"171.1.1.5"}}
{"timestamp":"1525762730.420288801","source":"tsa","message":"tsa.connection.channel.forward-worker.register.failed-to-reach-worker","log_level":1,"data":{"baggageclaim-took":"2.558173ms","garden-took":"2.020758ms","remote":"10.240.0.223:47690","session":"171.1.1.5"}}
Have tried a pipeline which uses no credentials from Vault but this gets the same issues as well
Running fly -t target intercept c pipeline/resource
and then doing a ping www.google.com
successfully resolves the ip address but cannot at all connect and shows 100% packet loss after being interrupted.
bash-4.4# ping www.google.com
PING www.google.com (172.217.20.100): 56 data bytes
^C
--- www.google.com ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
traceroute furthermore shows it doesn't get past the first router
bash-4.4# traceroute www.google.com
traceroute to www.google.com (172.217.20.100), 30 hops max, 46 byte packets
1 10.254.0.209 (10.254.0.209) 0.014 ms 0.012 ms 0.010 ms
2 * * *
3 * *^C
Can this be tied to any known issues and are there any suggestions for resolving this?