Skip to content

Conversation

UnwashedMeme
Copy link
Contributor

@UnwashedMeme UnwashedMeme commented Aug 14, 2020

This script has several tests for what the container runtime situation
looks like to determine how best to restart the underlying containers
(going around the kubelet) so that the new networking configuration
can take effect.

The first test looks to see if the crictl config file is configured to use
docker, but if that file doesn't exist then it fails. I believe docker
is the default if this hasn't been configured at all so if that file
doesn't exist then use docker.

Fixes #12850

node-init restartPods should use docker if /etc/crictl.yaml not found

Signed-off-by: Nathan Bird njbird@infiniteenergy.com

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
    • not sure where we would test this, open to ideas
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
    • happy to clarify if there are questions, I think I wrote it up well
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
    • happy to clarify if there are questions, I think I wrote it up well
  • Thanks for contributing!

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 14, 2020
@UnwashedMeme UnwashedMeme changed the title fix: node-init should use docker if /etc/crictl.yaml not found fix: node-init restartPods should use docker if /etc/crictl.yaml not found Aug 14, 2020
@maintainer-s-little-helper
Copy link

Commit 9fa556c8f79c4fc8e75cf2771536bdbc17699286 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 14, 2020
@coveralls
Copy link

coveralls commented Aug 14, 2020

Coverage Status

Coverage decreased (-0.02%) to 37.11% when pulling cbfb6ba6e43f995ce782854ca7895168a461cbbe on InfiniteEnergy:node-init-docker into bfd972f on cilium:master.

@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 14, 2020
@UnwashedMeme UnwashedMeme marked this pull request as ready for review August 14, 2020 18:15
@UnwashedMeme UnwashedMeme requested review from a team as code owners August 14, 2020 18:15
@UnwashedMeme UnwashedMeme requested a review from a team August 14, 2020 18:15
@sayboras sayboras added area/helm Impacts helm charts and user deployment experience release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Aug 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 16, 2020
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your github issue and PR 💯.

I got one question as below.

@@ -55,7 +55,7 @@ spec:
fi

echo "Waiting on pods to stop..."
if grep -q 'docker' /etc/crictl.yaml; then
if [ ! -x /etc/crictl.yaml ] || grep -q 'docker' /etc/crictl.yaml; then
Copy link
Member

@sayboras sayboras Aug 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why -x is used here?

With your changes, docker will be used for below scenario. However, crictl should be executed intead if I am not wrong.

root@kind-control-plane:/# ls -lrt /etc/crictl.yaml 
-rw-r--r-- 1 root root 56 Jul  8 21:00 /etc/crictl.yaml

root@kind-control-plane:/# cat /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe docker is the default if this hasn't been configured at all so if that file doesn't exist then use docker.

I just checked for EKS and GKE clusters that I have, both are having /etc/crictl.yaml file and both are using docker.

Might need others to confirm this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this page, it aligns with what you're saying @sayboras. I'd also like other to weigh in as I'm not too familiar with the node-init area.

Copy link
Contributor Author

@UnwashedMeme UnwashedMeme Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why -x is used here?

I thought this was the "exists" test; I just double checked and it is "executable" which would be problematic. I've fixed that to use -f which I think is more accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this series of conditions is trying to determine is "What is the container runtime?" It is currently (and I'm extending this some) using /etc/crictl.yaml as a signal, but this is probably not exactly correct. In a given random installation this config file might even be configured to be elsewhere.

My thinking is to follow the current pattern and try to improve the situation. In most cases if crictl.yaml explicitly mentions docker or there is no crictl.yaml config at all then the kubelet is probably using docker since that is the current default (accoding to the docs on --container-runtime). This change (now fixed with -f) will capture how AKS behaves (no crictl file) and I don't think will break EKS/GKE where that file does exist.

Other ideas:

  • Move this test down to the bottom of the if condition stack
  • Have the last two conditions test for the presence of the docker binary and crictl binary and use whichever one is found
  • Try to parse kubelet's commandline to look for relevant flags like
    • --container-runtime currently defaults to docker which was my thinking on "if this hasn't been configured, use docker"
  • --config, and then parse that
  • I don't really like this option. It will be much harder to implement (error prone) and still doesn't catch that the default value if no flags are specified can change, or be compiled independently to be different.

Copy link
Contributor Author

@UnwashedMeme UnwashedMeme Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just confirmed on the 3 AKS scenarios i know about (from these docs)

  • image AKS Ubuntu 16.04:
    • This is the default
    • /etc/crictl.yaml doesn't exist, testing its existence works
    • containers are restarted to get onto CNI net
    • cilium connectivity-check succeeds
  • image AKS Ubuntu 18.04
    • Built with --aks-custom-headers "CustomizedUbuntu=aks-ubuntu-1804"
    • /etc/crictl.yaml doesn't exist, testing its existence works
    • containers are restarted to get onto CNI net
    • cilium connectivity-check succeeds
  • image AKS Ubuntu 18.04 w/ containerd
    • built with --aks-custom-headers "CustomizedUbuntu=aks-ubuntu-1804,ContainerRuntime=containerd"

    • /etc/crictl.yaml is present; it sets runtime-endpoint: unix:///run/containerd/containerd.sock

    • crictl binary is not present, docker binary is. This is mentioned in the limitations section and makes probing for which runtime is in use by looking for what binaries are present problematic.

    • cilium-node-init logs show restart fails

          Restarting kubenet managed pods
          Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
          bash: line 86: crictl: command not found
          bash: line 86: crictl: command not found
          bash: line 86: crictl: command not found
          bash: line 86: crictl: command not found
          bash: line 86: crictl: command not found
          bash: line 86: crictl: command not found
          bash: line 86: crictl: command not found
          bash: line 86: crictl: command not found
          bash: line 86: crictl: command not found
          Node initialization complete
          bash: line 86: crictl: command not found
          !!! startup-script succeeded!
      
    • Other issues with pods failing to start, haven't fully debugged it yet b/c i think that's outside of the scope of this ticket. With crictl not present the restart pods technique cilium-node-init is attempting will be harder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your detailed explanation and checking. I think your changes now are definitely good enough for the scope of issue. Docker will be used in either of below cases:

  • /etc/crictl.yaml doesn't not exists
  • Or docker is specifically mentioned in /etc/crictl.yaml.

Ortherwise, critctl will be used by default.

@sayboras sayboras requested review from a team August 16, 2020 11:16
This script has several tests for what the container runtime situation
looks like to determine how best to restart the underlying containers
(going around the kubelet) so that the new networking configuration
can take effect.

The first test looks to see if the crictl config file is configured to use
docker, but if that file doesn't exist then it fails. I believe docker
is the default if this hasn't been configured at all so if that file
doesn't exist then use docker.

Fixes #12850

Signed-off-by: Nathan Bird <njbird@infiniteenergy.com>
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sayboras sayboras requested a review from a team August 19, 2020 07:35
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for gotten around it!

@aanm
Copy link
Member

aanm commented Aug 20, 2020

test-me-please

Copy link
Member

@fristonio fristonio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: Thanks for a detailed explanation in the comments.

@sayboras
Copy link
Member

sayboras commented Aug 21, 2020

CI passed, mark this one as ready to merge.

PS: Add backport for 1.6, 1.7 and 1.8. I made mistake selected backport-done/1.8 😓 grrrr

@sayboras sayboras added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. backport-done/1.8 labels Aug 21, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 21, 2020
@sayboras sayboras added needs-backport/1.6 ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Aug 21, 2020
@errordeveloper errordeveloper merged commit 552c823 into cilium:master Aug 25, 2020
@UnwashedMeme UnwashedMeme deleted the node-init-docker branch August 25, 2020 17:24
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 27, 2020
This was referenced Aug 27, 2020
@aanm aanm added release-note/bug This PR fixes an issue in a previous release of Cilium. and removed release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/helm Impacts helm charts and user deployment experience release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

helm chart nodeinit.restartPods doesn't work on azure aks-ubuntu-1804 images
10 participants