Skip to content

Conversation

emcfarlane
Copy link
Contributor

@emcfarlane emcfarlane commented May 30, 2018

If c.hostList only consists of the single namenode connection this will deadlock.

@colinmarc
Copy link
Owner

Hm, how? The tests only use one namenode, so that's surprising.

@emcfarlane
Copy link
Contributor Author

Wasn't sure how to test it here. If the namenode connection fails the resolveConnection goes to the for loop retry, but doesn't try to redial the same host connection so will always return the original error.

@colinmarc
Copy link
Owner

So does it deadlock, or just report "no available namenodes"? This code isn't well tested - it makes me nervous.

@emcfarlane
Copy link
Contributor Author

Sorry yep deadlock is not right the right term, it loops returning the error.

...
2018/05/30 11:14:40 Error handling request: health check fail: [no available namenodes: dial tcp: lookup hadoop-nn on 100.64.0.10:53: no such host]
2018/05/30 11:15:10 Error handling request: health check fail: [no available namenodes: dial tcp: lookup hadoop-nn on 100.64.0.10:53: no such host]

@emcfarlane
Copy link
Contributor Author

Was introduced here: #77

@colinmarc
Copy link
Owner

A test is good from the single-namenode perspective; the problem is just that the multiple-namenode situation isn't well tested.

@itszootime thoughts on this patch? If I squint my eyes really hard it seems like it'll work, but that we might want the desired behavior to be to try the second namenode after the first, even if the backoff period has elapsed. (I think that's what the deleted code was trying to accomplish?)

@emcfarlane
Copy link
Contributor Author

Immediately after a failed connection the new behavior will switch to the next host as the backoff time hasn't elapsed. After the backoff time I would expect any connection to be retryable.

@colinmarc colinmarc merged commit 195e2c7 into colinmarc:master Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants