-
-
Notifications
You must be signed in to change notification settings - Fork 867
Description
Configuration Assistance Request
Note, below, I'm referring to "ATC" and "TSA" interchangeably, as I'm running binaries.
When configuring two ATCs with the --peer-url
web command option, I'm getting the following errors when a worker attempts to register or forward.
From atc1 (and also forwarded to worker logs):
{"timestamp":"...","source":"tsa","message":"tsa.connection.channel.forward-worker.register.failed-to-fetch-containers","log_level":2,"data":{"error":"Get http://api/containers: dial tcp <ip-of-atc2>:38775: getsockopt: connection refused","remote":"<ip-of-proxy>:52066","session":"1.1.1.5"}}
{"timestamp":"...","source":"tsa","message":"tsa.connection.channel.forward-worker.register.failed-to-list-volumes","log_level":2,"data":{"error":"Get http://concourse-atc2:33366/volumes: dial tcp <ip-of-atc2>:33366: getsockopt: connection refused","remote":"<ip-of-proxy>:52066","session":"1.1.1.5"}}
Some Facts
- Using haproxy to round-robin TSAs
- The two ATCs, the proxy, and workers live in the same private network, each on their own AWS EC2 instance
- Every
*-bind-ip
option available to theweb
andworker
commands is set to0.0.0.0
. Not sure if this is necessary, but I'm just trying to open everything up to get things working. - The TSAs are listening on the default port: 2222
- As I am running several separate concourse installations through one proxy instance, differentiated by TSA port number. For example, the proxy listens on port 2224 and forwards to the default 2222 on the TSA host.
- Host names are mapped correctly according to the above error messages (which show both the host name along with the correct IP)
- When both ATCs are brought up, peered, and no worker is attempting to register/forward, no errors occur, which I think tells me that the
--peer-url
values are correct, and the ATCs have access to each other over 8080 - I can prove that atc1 has connectivity to act2, and vice versa, using
telnet
to connect andnc
to listen on an example ephemeral port from one of the above error messages - If relevant, I can prove the same connectivity between the proxy, ATCs/TSAs, and workers in all directions
- When I remove the
--peer-url
option and run only one ATC/TSA, the worker is able to register/forward successfully and show up infly workers
, which tells me that proxy config and ssh keys are correct - When I remove the proxy from the equation entirely (and again running two ATCs/TSAs, peered), and have the
--external-url
s point at atc1 (not the proxy), and have the worker point directly at either of the TSAs, the same errors occur (except for the difference in the "remote" value). This suggests that I've done something fundamentally incorrect, as multiple ATCs/TSAs should be providing redundancy so that service is upheld when one is unavailable, but that's not happening. - When I run without any proxy and only one ATC/TSA, the worker shows up in
fly workers
- In between changing any concourse command options to perform any of the above tests, I am truncating the
workers
table - The AWS security group associated with every involved EC2 instance grants connectivity over all ports and all protocols, and the subnet CIDRs to which any EC2 instance is member is listed in the source
Relevant Configuration
# Concourse web #1:
--peer-url http://concourse-atc2:8080
--external-url https://<public-host-name-of-proxy>
--tsa-bind-ip 0.0.0.0
# Concourse web #2:
--peer-url http://concourse-atc1:8080
--external-url https://<public-host-name-of-proxy>
--tsa-bind-ip 0.0.0.0
# Worker:
--tsa-host <internal-host-name-of-proxy>
--tsa-port 2224
--bind-ip 0.0.0.0
--garden-bind-ip 0.0.0.0
--baggageclaim-bind-ip 0.0.0.0
# Proxy config for TSAs:
frontend concourse-tsa *:2224
mode tcp
option tcplog
default_backend concourse-tsa-pool
backend concourse-tsa-pool
mode tcp
option tcplog
balance roundrobin
server concourse-atc1 concourse-atc1:2222 check
server concourse-atc2 concourse-atc2:2222 check
Stuff
- Concourse version: Tried 3.8.0, 3.9.1
- Deployment type (BOSH/Docker/binary): binary
- Infrastructure/IaaS: AWS
- Browser (if applicable): Chromium on Linux
- Did this used to work? Not for me
Summary
So, given all of this, I'm lead to believe that I've provided adequate connectivity between these concourse parts such that the ATCs/TSAs should be able to talk to each other and allow successful worker registration/forwarding. My wild-guess theory is that I've somehow not configured the ATCs/TSAs correctly such that one ATC/TSA is not listening on the ephemeral port for querying/manipulating containers and volumes when the other is trying to connect. Or, I've gotten confused from a number of id10t
errors.
I'd appreciate some guidance. What might I have done wrong given these clues?