Skip to content

Change default agent health check port to avoid conflicts #19830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 17, 2022

Conversation

tklauser
Copy link
Member

The default value 9876 for the agent health port introduced in commit
efffbdb ("daemon: expose HTTP endpoint on localhost for health
checks") conflicts with Istio's ControlZ port. The latter has been in
use for longer. Because the port is only exposed on localhost for use by
liveness/readiness probe, we can change the default value without
breaking users. Thus, change it to 9879 for which there doesn't seem to
be any documented use.

Fixes #19817

tklauser added 2 commits May 16, 2022 10:14
The default value 9876 for the agent health port introduced in commit
efffbdb ("daemon: expose HTTP endpoint on localhost for health
checks") conflicts with Istio's ControlZ port. The latter has been in
use for longer. Because the port is only exposed on localhost for use by
liveness/readiness probe, we can change the default value without
breaking users. Thus, change it to 9879 for which there doesn't seem to
be any documented use.

Suggested-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser added area/daemon Impacts operation of the Cilium daemon. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. needs-backport/1.9 labels May 16, 2022
@tklauser tklauser requested review from joestringer and a team May 16, 2022 08:27
@tklauser tklauser requested a review from a team as a code owner May 16, 2022 08:27
@tklauser tklauser requested a review from a team May 16, 2022 08:27
@tklauser tklauser requested a review from a team as a code owner May 16, 2022 08:27
@tklauser
Copy link
Member Author

/test

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty straightforward. I assume this does not have any upgrade impact, as this is only used on the pod spec, which is always updated together with the cilium-agent image, correct?

@tklauser
Copy link
Member Author

Looks pretty straightforward. I assume this does not have any upgrade impact, as this is only used on the pod spec, which is always updated together with the cilium-agent image, correct?

That's my understanding as well. As far as I can tell it's only used in the startupProbe, livenessProbe and readinessProbe of the cilium-agent DaemonSet:

startupProbe:
httpGet:
host: {{ .Values.ipv4.enabled | ternary "127.0.0.1" "::1" | quote }}
path: /healthz
port: {{ .Values.healthPort }}
scheme: HTTP
httpHeaders:
- name: "brief"
value: "true"
failureThreshold: {{ .Values.startupProbe.failureThreshold }}
periodSeconds: {{ .Values.startupProbe.periodSeconds }}
successThreshold: 1
{{- end }}
livenessProbe:
{{- if or .Values.keepDeprecatedProbes $defaultKeepDeprecatedProbes }}
exec:
command:
- cilium
- status
- --brief
{{- else }}
httpGet:
host: {{ .Values.ipv4.enabled | ternary "127.0.0.1" "::1" | quote }}
path: /healthz
port: {{ .Values.healthPort }}
scheme: HTTP
httpHeaders:
- name: "brief"
value: "true"
{{- end }}
{{- if semverCompare "<1.20-0" .Capabilities.KubeVersion.Version }}
# The initial delay for the liveness probe is intentionally large to
# avoid an endless kill & restart cycle if in the event that the initial
# bootstrapping takes longer than expected.
# Starting from Kubernetes 1.20, we are using startupProbe instead
# of this field.
initialDelaySeconds: 120
{{- end }}
periodSeconds: {{ .Values.livenessProbe.periodSeconds }}
successThreshold: 1
failureThreshold: {{ .Values.livenessProbe.failureThreshold }}
timeoutSeconds: 5
readinessProbe:
{{- if or .Values.keepDeprecatedProbes $defaultKeepDeprecatedProbes }}
exec:
command:
- cilium
- status
- --brief
{{- else }}
httpGet:
host: {{ .Values.ipv4.enabled | ternary "127.0.0.1" "::1" | quote }}
path: /healthz
port: {{ .Values.healthPort }}
scheme: HTTP
httpHeaders:
- name: "brief"
value: "true"
{{- end }}
{{- if semverCompare "<1.20-0" .Capabilities.KubeVersion.Version }}
initialDelaySeconds: 5
{{- end }}
periodSeconds: {{ .Values.readinessProbe.periodSeconds }}
successThreshold: 1
failureThreshold: {{ .Values.readinessProbe.failureThreshold }}
timeoutSeconds: 5

Otherwise e.g. helm charts are checkout out from master instead of the
PR's ref.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser requested review from a team as code owners May 16, 2022 09:53
@tklauser
Copy link
Member Author

L4LB test is failing (https://github.com/cilium/cilium/actions/runs/2330685480) because helm charts for installation are checked out from master instead of the PR branch and thus the wrong port is used for the startup/liveness/readiness probes in that test. I've pushed an additional commit fixing that.

@tklauser
Copy link
Member Author

/test

@tklauser
Copy link
Member Author

/ci-l4lb

@tklauser
Copy link
Member Author

L4LB test passed with DO NOT MERGE commit: https://github.com/cilium/cilium/actions/runs/2331557522

Removing that commit and re-triggering CI.

@tklauser tklauser force-pushed the pr/agent-health-default-port branch from 4ebe1e1 to c6c4fd9 Compare May 16, 2022 11:50
@tklauser
Copy link
Member Author

/test

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the l4lb piece relates to the rest, but all of the individual changes LGTM.

@tklauser
Copy link
Member Author

I'm not sure how the l4lb piece relates to the rest, but all of the individual changes LGTM.

The change to the l4lb workflow is necessary because before it was using the helm charts from master rather than the PR branch. This lead to a mismatch in the agent health check port numbers between cilium-agent (using the new port 9879) vs. the helm charts from master still using port 9876 for the {startup,liveness,readiness}Probe.

@tklauser
Copy link
Member Author

Reviews are in and all required tests apart from L4LB XDP passed. That one was successful with the commit that's part of this PR, see #19830 (comment) and the explanation in #19830 (comment)

Marking as ready to merge.

@tklauser tklauser added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 17, 2022
@christarazi christarazi merged commit c3b2948 into master May 17, 2022
@christarazi christarazi deleted the pr/agent-health-default-port branch May 17, 2022 17:41
@jibi jibi added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels May 23, 2022
tklauser added a commit that referenced this pull request May 24, 2022
This should fix issues occuring after the change of the agent health
check port in #19830

Signed-off-by: Tobias Klauser <tobias@cilium.io>
tklauser added a commit that referenced this pull request May 24, 2022
This should fix issues occuring after the change of the agent health
check port in #19830

Signed-off-by: Tobias Klauser <tobias@cilium.io>
jibi pushed a commit that referenced this pull request May 24, 2022
This should fix issues occuring after the change of the agent health
check port in #19830

Signed-off-by: Tobias Klauser <tobias@cilium.io>
jibi pushed a commit that referenced this pull request May 24, 2022
This should fix issues occuring after the change of the agent health
check port in #19830

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Health port 9876 conflicts with Istio
7 participants