Skip to content

Dual ATC setup cannot register worker #2085

@troykinsella

Description

@troykinsella

Configuration Assistance Request

Note, below, I'm referring to "ATC" and "TSA" interchangeably, as I'm running binaries.

When configuring two ATCs with the --peer-url web command option, I'm getting the following errors when a worker attempts to register or forward.

From atc1 (and also forwarded to worker logs):

{"timestamp":"...","source":"tsa","message":"tsa.connection.channel.forward-worker.register.failed-to-fetch-containers","log_level":2,"data":{"error":"Get http://api/containers: dial tcp <ip-of-atc2>:38775: getsockopt: connection refused","remote":"<ip-of-proxy>:52066","session":"1.1.1.5"}}
{"timestamp":"...","source":"tsa","message":"tsa.connection.channel.forward-worker.register.failed-to-list-volumes","log_level":2,"data":{"error":"Get http://concourse-atc2:33366/volumes: dial tcp <ip-of-atc2>:33366: getsockopt: connection refused","remote":"<ip-of-proxy>:52066","session":"1.1.1.5"}}

Some Facts

  • Using haproxy to round-robin TSAs
  • The two ATCs, the proxy, and workers live in the same private network, each on their own AWS EC2 instance
  • Every *-bind-ip option available to the web and worker commands is set to 0.0.0.0. Not sure if this is necessary, but I'm just trying to open everything up to get things working.
  • The TSAs are listening on the default port: 2222
  • As I am running several separate concourse installations through one proxy instance, differentiated by TSA port number. For example, the proxy listens on port 2224 and forwards to the default 2222 on the TSA host.
  • Host names are mapped correctly according to the above error messages (which show both the host name along with the correct IP)
  • When both ATCs are brought up, peered, and no worker is attempting to register/forward, no errors occur, which I think tells me that the --peer-url values are correct, and the ATCs have access to each other over 8080
  • I can prove that atc1 has connectivity to act2, and vice versa, using telnet to connect and nc to listen on an example ephemeral port from one of the above error messages
  • If relevant, I can prove the same connectivity between the proxy, ATCs/TSAs, and workers in all directions
  • When I remove the --peer-url option and run only one ATC/TSA, the worker is able to register/forward successfully and show up in fly workers, which tells me that proxy config and ssh keys are correct
  • When I remove the proxy from the equation entirely (and again running two ATCs/TSAs, peered), and have the --external-urls point at atc1 (not the proxy), and have the worker point directly at either of the TSAs, the same errors occur (except for the difference in the "remote" value). This suggests that I've done something fundamentally incorrect, as multiple ATCs/TSAs should be providing redundancy so that service is upheld when one is unavailable, but that's not happening.
  • When I run without any proxy and only one ATC/TSA, the worker shows up in fly workers
  • In between changing any concourse command options to perform any of the above tests, I am truncating the workers table
  • The AWS security group associated with every involved EC2 instance grants connectivity over all ports and all protocols, and the subnet CIDRs to which any EC2 instance is member is listed in the source

Relevant Configuration

# Concourse web #1:
--peer-url http://concourse-atc2:8080
--external-url https://<public-host-name-of-proxy>
--tsa-bind-ip 0.0.0.0

# Concourse web #2:
--peer-url http://concourse-atc1:8080
--external-url https://<public-host-name-of-proxy>
--tsa-bind-ip 0.0.0.0

# Worker:
--tsa-host <internal-host-name-of-proxy>
--tsa-port 2224 
--bind-ip 0.0.0.0
--garden-bind-ip 0.0.0.0
--baggageclaim-bind-ip 0.0.0.0

# Proxy config for TSAs:
frontend concourse-tsa *:2224
	mode tcp
	option tcplog
	default_backend concourse-tsa-pool
backend concourse-tsa-pool
	mode tcp
	option tcplog
	balance	roundrobin
	server concourse-atc1 concourse-atc1:2222 check
	server concourse-atc2 concourse-atc2:2222 check

Stuff

  • Concourse version: Tried 3.8.0, 3.9.1
  • Deployment type (BOSH/Docker/binary): binary
  • Infrastructure/IaaS: AWS
  • Browser (if applicable): Chromium on Linux
  • Did this used to work? Not for me

Summary

So, given all of this, I'm lead to believe that I've provided adequate connectivity between these concourse parts such that the ATCs/TSAs should be able to talk to each other and allow successful worker registration/forwarding. My wild-guess theory is that I've somehow not configured the ATCs/TSAs correctly such that one ATC/TSA is not listening on the ephemeral port for querying/manipulating containers and volumes when the other is trying to connect. Or, I've gotten confused from a number of id10t errors.

I'd appreciate some guidance. What might I have done wrong given these clues?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions