Skip to content

Conversation

snowp
Copy link
Contributor

@snowp snowp commented Jul 26, 2018

Description:
In order to allow hosts to move from one priority to another, this
compares a "new" host against all known hosts in order to determine
whether we've already health checked this host successfully.
The idea is that if we already know that a host is healthy we should
treat it as such when it's added to a different priority. Without this, we'll
fail requests coming in between the EDS update and the moved hosts
being successfully health checked.

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Low, only affects the initial health check value when hosts are added to a priority
Testing: Unit test
Docs Changes: N/A
Release Notes: N/A

Fixes #3930

Snow Pettersen added 2 commits July 26, 2018 00:30
In order to allow hosts to move from one priority to another, this
compares a "new" host against all known hosts in order to determine
whether we can avoid reseting the active health check value of the host.
The idea is that if we already know that a host is healthy we should
treat it as such when it's added to a different priority.

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
@snowp snowp force-pushed the check-all-hosts branch from 3418e07 to 6c5d62a Compare July 26, 2018 04:48
@snowp
Copy link
Contributor Author

snowp commented Jul 26, 2018

Test failure seems to be due to a timeout in the websockets integration tests, which I doubt is related. Would it be possible to rerun the Ipv6 tests?

@dio
Copy link
Member

dio commented Jul 26, 2018

git commit --allow-empty -m 'trigger build' -s; git push probably could help 😄.

Signed-off-by: Snow Pettersen <snowp@squareup.com>
@mattklein123 mattklein123 self-assigned this Jul 27, 2018
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

@@ -61,6 +61,15 @@ void EdsClusterImpl::onConfigUpdate(const ResourceVector& resources, const std::
throw EnvoyException(fmt::format("Unexpected EDS cluster (expecting {}): {}", cluster_name_,
cluster_load_assignment.cluster_name()));
}

HostVector all_current_hosts;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree with the intention here, I'm concerned about the perf impact of this change, given that we are already having perf issues with this code which is starting to be addressed in #3941 by @rgs1.

I think we might need to holistically step back and try to figure out the right path forward for solving this issue while also keeping/improving perf. Concretely, I wonder if we should be keeping an "all hosts" unordered_map keyed by address in parallel to the per-priority vectors? This would allow us to quickly do checks for whether a host already exists in a different priority.

I will also review #3941 today and will probably have additional thoughts after I do that so will circle back then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about the map approach but I wasn't sure about the perf requirements of this code path so I wanted to float the naive solution first.

Thinking out loud a bit: if we kept such a map we could potentially simplify/speed up this code quite a bit by making addresses unique across priorities: instead of iterating over the current hosts we'd just check the map for an existing host with that address. This would make moving a host between priorities no longer require additional code and would eliminate the O(n) scan over existing addresses. A nice thing here is that we could reuse the Host from another priority when a host moves, saving us some heap allocations.

Alternatively if we want to maintain the current behavior of allowing duplicate addresses across priorities: we could maintain two maps std::unordered_map<AddressPtr, Host> and std::unordered_map<std::pair<AddressPtr, Integer>, Host> where one would let us look up hosts from any priority (for the "all hosts" check) and the other would let us look up hosts within the same priority. Essentially this transforms the code in my PR to two map lookups instead of two array scans per priority.

Unclear to me whether the host uniqueness only within priority invariant is intentional or just a result of the implementation, so not sure how important it is to maintain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear to me whether the host uniqueness only within priority invariant is intentional or just a result of the implementation, so not sure how important it is to maintain.

I actually don't know if this was intentional or not. I feel like it was but I'm not sure. IMO it would be great to simplify it if we can. @mpuncel @htuch @alyssawilk @zuercher do you remember the history here? I don't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have historical context on the code, but that plan sounds reasonable. I can't think of a reason you'd want multiple endpoints with the same address in the same cluster since there is proper support for load balancing weight.

@stale
Copy link

stale bot commented Aug 3, 2018

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 3, 2018
Snow Pettersen added 2 commits August 3, 2018 17:04
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 3, 2018
Signed-off-by: Snow Pettersen <snowp@squareup.com>
@snowp
Copy link
Contributor Author

snowp commented Aug 6, 2018

Updated this to maintain a map of address -> host instead of constructing a list on each config update. PTAL

return updated_hosts.count(host->address()->asString());
}), current_hosts.end());

// Do a single pass over the current_hosts and remove any that were added to the map in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy/pasta?

@@ -41,7 +41,7 @@ class EdsClusterImpl : public BaseDynamicClusterImpl,
LocalityWeightsMap& locality_weights_map,
LocalityWeightsMap& new_locality_weights_map,
PriorityStateManager& priority_state_manager,
const HostVector& all_hosts);
std::unordered_map<std::string, HostSharedPtr>& updated_hosts);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new param name (updated_hosts) doesn't match the definition's

Copy link
Member

@rgs1 rgs1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch @mattklein123 took the liberty of doing a quick pass, since everything related to health checks/eds/host updates performance is close to my heart.

bool EdsClusterImpl::updateHostsPerLocality(
const uint32_t priority, const HostVector& new_hosts, LocalityWeightsMap& locality_weights_map,
LocalityWeightsMap& new_locality_weights_map, PriorityStateManager& priority_state_manager,
std::unordered_map<std::string, HostSharedPtr>& new_all_hosts) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_all_hosts is a bit confusing here, was this meant to be updated_hosts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah names went through a few iterations, will change

if (updateDynamicHostList(new_hosts, *current_hosts_copy, hosts_added, hosts_removed) ||

if (updateDynamicHostList(new_hosts, *current_hosts_copy, hosts_added, hosts_removed,
new_all_hosts, all_hosts_) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why pass all_hosts_ as a param given it's a member of BaseDynamicClusterImpl (so updateDynamicHostList() has access to it already)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! i'll update

@@ -40,7 +40,8 @@ class EdsClusterImpl : public BaseDynamicClusterImpl,
bool updateHostsPerLocality(const uint32_t priority, const HostVector& new_hosts,
LocalityWeightsMap& locality_weights_map,
LocalityWeightsMap& new_locality_weights_map,
PriorityStateManager& priority_state_manager);
PriorityStateManager& priority_state_manager,
std::unordered_map<std::string, HostSharedPtr>& updated_hosts);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted before, doesn't match the definition's param name.

std::unordered_set<std::string> host_addresses;

// Keep track of hosts we've seen during this pass so we can remove it from the
// list of current (i.e. existing hosts from this priority)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: most (all?) envoy comments are fully formed sentences (e.g.: finish them with a period, etc).

Snow Pettersen added 2 commits August 6, 2018 21:44
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on.

}),
current_hosts.end());

// Do a single pass over the current_hosts and remove any that were added to the map in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this partial comment is spurious.

const bool dont_remove_healthy_hosts =
health_checker_ != nullptr && !info()->drainConnectionsOnHostRemoval();
// If there are removed hosts, check to see if we should only delete if unhealthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, i was supposed to remove the other comment you commented on

i = current_hosts.erase(i);
} else {
i++;
}
}
}

// Add the remaining hosts into the updated list of all hosts
for (auto& host : final_hosts) {
updated_hosts[host->address()->asString()] = host;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like every place where you put a host in final_hosts, you also add it to updated_hosts. It this loop necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you're right, i'll remove

for (auto& host : final_hosts) {
updated_hosts[host->address()->asString()] = host;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a comment on line 952 (below) that I think needs to be updated. It currently says

// During the search we moved all of the hosts from hosts_ into final_hosts so just
// move them back.

@@ -1054,15 +1065,19 @@ void StrictDnsClusterImpl::ResolveTarget::startResolve() {
parent_.info_, dns_address_, Network::Utility::getAddressWithPort(*address, port_),
lb_endpoint_.metadata(), lb_endpoint_.load_balancing_weight().value(),
locality_lb_endpoint_.locality(), lb_endpoint_.endpoint().health_check_config()));
updated_hosts[address->asString()] = new_hosts.back();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pre-emptively putting this host in updated_hosts will prevent the max host weight check from occurring in updateDynamicHostList. Why not just let that function do this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree - i suspect this was left over from a previous iteration and is no longer needed.

Signed-off-by: Snow Pettersen <snowp@squareup.com>
@snowp
Copy link
Contributor Author

snowp commented Aug 9, 2018

This should be ready for another review

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is a great improvement. Some small comments.

@@ -630,6 +632,7 @@ class StrictDnsClusterImpl : public BaseDynamicClusterImpl {
uint32_t port_;
Event::TimerPtr resolve_timer_;
HostVector hosts_;
std::unordered_map<std::string, HostSharedPtr> all_hosts_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

if (!hosts_added.empty() || !current_hosts.empty()) {
hosts_removed = std::move(current_hosts);
current_hosts = std::move(final_hosts);
return true;
} else {
// During the search we moved all of the hosts from hosts_ into final_hosts so just
// move them back.
// During the search populated final_hosts with either from hosts_ or from all_hosts, so
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a little hard to read and I think the comment is important. Can you tighten?

@@ -133,14 +138,17 @@ bool EdsClusterImpl::updateHostsPerLocality(const uint32_t priority, const HostV
// out of the locality scheduler, we discover their new weights. We don't currently have a shared
// object for locality weights that we can update here, we should add something like this to
// improve performance and scalability of locality weight updates.
if (updateDynamicHostList(new_hosts, *current_hosts_copy, hosts_added, hosts_removed) ||

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: del newline

@@ -503,6 +503,7 @@ class ClusterImplBase : public Cluster, protected Logger::Loggable<Logger::Id::u

protected:
PrioritySetImpl priority_set_;
std::unordered_map<std::string, Upstream::HostSharedPtr> all_hosts_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used in the base class? Should this be part of BaseDynamicClusterImpl ?


// Keep track of hosts we've seen during this pass so we can remove it from the
// list of current (i.e. existing hosts from this priority) hosts.
std::unordered_set<std::string> seen_hosts(current_hosts.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the ongoing PR that @dio is working on where we are trying to make this code easier to understand. I actually think this PR is a huge improvement, but can you potentially either add more comments or think of different names for the variables? "seen_hosts" and "current_hosts" is still pretty confusing. In this code I would opt for very long and verbose names and/or lots of comments since it's so hard to understand.

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Copy link
Member

@rgs1 rgs1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass

hosts_added.push_back(host);

// If we are depending on a health checker, we initialize to unhealthy.
// If we are depending on a health checker, determine what to initialize the new host
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this comment changing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 seems more clear before

* @param hosts_added_to_priority will be populated with hosts added to the priority.
* @param hosts_removed_from_priority will be populated with hosts removed from the priority.
* @param updated_hosts is used to aggregate the new state of all hosts accross priority, and will
* be be updated with the hosts that remain in this priority after the update.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: be be

* @param current_priority_hosts the full lists of hosts for the priority to be updated. The list
* will be modified to contain the updated list of hosts.
* @param hosts_added_to_priority will be populated with hosts added to the priority.
* @param hosts_removed_from_priority will be populated with hosts removed from the priority.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't match the param name (hosts_removed)

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is a big improvement. LGTM other than some nits. @zuercher can you take another pass through?

hosts_added.push_back(host);

// If we are depending on a health checker, we initialize to unhealthy.
// If we are depending on a health checker, determine what to initialize the new host
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 seems more clear before

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One comment from me

HostVector& hosts_removed_from_current_priority,
std::unordered_map<std::string, HostSharedPtr>& updated_hosts);

std::unordered_map<std::string, Upstream::HostSharedPtr> all_hosts_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like all_hosts_ is only used by updateDynamicHostList. Perhaps it should be private with an explicit method that takes updated_hosts invoked after all the priorities have been updated? My reasoning is that it makes management of all_hosts_ explicitly the responsibility of BaseDynamicClusterImpl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i like that idea

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit da500d2 into envoyproxy:master Aug 14, 2018
snowp pushed a commit to snowp/envoy that referenced this pull request Aug 14, 2018
* origin/master:
  fix flaky RBAC integration test. (envoyproxy#4147)
  header_map: copy constructor for HeaderMapImpl. (envoyproxy#4129)
  test: moving websocket tests to using HTTP codec. (envoyproxy#4143)
  upstream: init host hc value based on hc value from other priorities (envoyproxy#3959)

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants