helm: Remove default toleration on hubble-relay deployment #13237

gandro · 2020-09-21T15:01:02Z

This removes the default toleration of the hubble-relay deployment which
allowed it to be scheduled on any node. In contrast to cilium and
cilium-operator, which are capable of running in the host network
namespace, Hubble Relay does require pod connectivity to be functional.

The existing catch-all toleration was intended to provide cluster-wide
network visibility in cases where nodes are unavailable. However, the
current toleration can cause Hubble Relay to be scheduled on these
unhealthy nodes and prevent it from running correctly, even though
untainted nodes would have been available.

Single-node clusters (such as minikube) intended to run workloads will
not have any tainted nodes and are thus are unaffected by this change.
Users who who have taints on every node in their cluster will have to
use the newly introduced hubble-relay.tolerations Helm value to
introduce custom tolerations for Hubble Relay.

Fixes: #13166

Remove the default toleration on hubble-relay deployment.
Set the Helm value `hubble-relay.tolerations[0].operator=Exists` to restore the previous behavior.

This removes the default toleration of the hubble-relay deployment which allowed it to be scheduled on any node. In contrast to cilium and cilium-operator, which are capable of running in the host network namespace, Hubble Relay does require pod connectivity to be functional. The existing catch-all toleration was intended to provide cluster-wide network visibility in cases where nodes are unavailable. However, the current toleration can cause Hubble Relay to be scheduled on these unhealthy nodes and prevent it from running correctly, even though untainted nodes would have been available. Single-node clusters (such as minikube) intended to run workloads will not have any tainted nodes and are thus are unaffected by this change. Users who who have taints on every node in their cluster will have to use the newly introduced `hubble-relay.tolerations` Helm value to introduce custom tolerations for Hubble Relay. Fixes: #13166 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2020-09-21T15:10:49Z

Note: In contrast to what was discussed in #13166, I opted to remove the toleration from Relay by default. While we could define preferredDuringSchedulingIgnoredDuringExecution node affinity for certain taints (such as e.g. the master node), I think the existing catch-all toleration is not helpful in almost all cases.

If there are both faulty (tainted) and healthy nodes, then we do not want Hubble Relay to be scheduled on the faulty nodes. A catch-all toleration as we had it before however can cause the pod to be scheduled on the faulty nodes instead of the healthy ones.
Cluster operators will be using taints to explicitly avoid pods being scheduled in these nodes. We should respect these taints and require cluster operators to manually set tolerations for hubble-relay.
Per node observability via Hubble is still available if Relay cannot be scheduled on any node. I don't consider Hubble Relay to be as a critical pod as Cilium or Cilium Operator.

gandro · 2020-09-21T15:11:10Z

test-me-please

qmonnet · 2020-09-22T09:38:27Z

@gandro Should this be backported to 1.8?

gandro · 2020-09-22T10:02:28Z

@gandro Should this be backported to 1.8?

It wouldn't hurt. Should be fairly straightforward to backport. Adding the label.

gandro added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/hubble area/helm Impacts helm charts and user deployment experience labels Sep 21, 2020

gandro requested review from a team as code owners September 21, 2020 15:01

gandro requested review from a team September 21, 2020 15:01

rolinh approved these changes Sep 21, 2020

View reviewed changes

christarazi approved these changes Sep 21, 2020

View reviewed changes

kaworu approved these changes Sep 22, 2020

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 22, 2020

qmonnet approved these changes Sep 22, 2020

View reviewed changes

qmonnet merged commit f0bd719 into master Sep 22, 2020

qmonnet deleted the pr/gandro/hubble-relay-tolerations branch September 22, 2020 09:35

gandro added the needs-backport/1.8 label Sep 22, 2020

vadorovsky mentioned this pull request Sep 22, 2020

v1.8 backports 2020-09-22 #13246

Merged

vadorovsky added backport-pending/1.8 and removed needs-backport/1.8 labels Sep 22, 2020

qmonnet added backport-done/1.8 and removed backport-pending/1.8 labels Sep 24, 2020

joestringer mentioned this pull request Oct 1, 2020

Prepare for release v1.8.4 #13361

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

helm: Remove default toleration on hubble-relay deployment #13237

helm: Remove default toleration on hubble-relay deployment #13237

Uh oh!

gandro commented Sep 21, 2020

Uh oh!

gandro commented Sep 21, 2020

Uh oh!

gandro commented Sep 21, 2020

Uh oh!

qmonnet commented Sep 22, 2020

Uh oh!

gandro commented Sep 22, 2020

Uh oh!

Uh oh!

helm: Remove default toleration on hubble-relay deployment #13237

helm: Remove default toleration on hubble-relay deployment #13237

Uh oh!

Conversation

gandro commented Sep 21, 2020

Uh oh!

gandro commented Sep 21, 2020

Uh oh!

gandro commented Sep 21, 2020

Uh oh!

qmonnet commented Sep 22, 2020

Uh oh!

gandro commented Sep 22, 2020

Uh oh!

Uh oh!