Skip to content

Conversation

Imadzuma
Copy link
Contributor

Description

During upgrade pcoderure kubemarine updates containerd configuration (and maybe containerd version) on nodes before upgrading kubernetes version on them using kubeadm command. But sometimes, containerd does not have time to start before the kubeadm command is launched, which is why the procedure fails. This is floating issue:

2024-11-14 07:32:47,829 DEBUG 'containerd' package upgrade is not required
2024-11-14 07:32:47,837 DEBUG Uploading containerd configuration to fullha-master2 node...
2024-11-14 07:32:47,838 DEBUG Uploading containerd registries configuration to /etc/containerd/certs.d on fullha-master2 node...
2024-11-14 07:32:47,840 DEBUG Restarting Containerd on fullha-master2 node...
2024-11-14 07:32:49,014 DEBUG Patching /var/lib/kubelet/kubeadm-flags.env on fullha-master2 node...
[upgrade] Reading configuration from the cluster...
[upgrade] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:
failed to create new CRI runtime service: validate service connection: validate CRI v1 runtime API for endpoint "unix:///var/run/containerd/containerd.sock": rpc error: code = Unknown desc = server is not initialized yet[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
2024-11-14 07:32:49,975 CRITICAL FAILURE!
2024-11-14 07:32:49,976 CRITICAL TASK FAILED kubernetes
2024-11-14 07:32:49,976 CRITICAL KME0002: Remote group exception
2024-11-14 07:32:49,976 CRITICAL 10.102.0.2:
2024-11-14 07:32:49,976 CRITICAL 	Encountered a bad command exit code!
2024-11-14 07:32:49,976 CRITICAL 	
2024-11-14 07:32:49,976 CRITICAL 	Command: 'sudo kubeadm upgrade node --certificate-renewal=true --patches=/etc/kubernetes/patches && sudo kubectl uncordon fullha-master2 && sudo systemctl restart kubelet'
2024-11-14 07:32:49,976 CRITICAL 	
2024-11-14 07:32:49,976 CRITICAL 	Exit code: 1
2024-11-14 07:32:49,976 CRITICAL 	
2024-11-14 07:32:49,976 CRITICAL 	=== stdout ===
2024-11-14 07:32:49,976 CRITICAL 	already printed
2024-11-14 07:32:49,976 CRITICAL 	
An unexpected error occurred. It is failed to solve the problem automatically. Follow the instructions from the Troubleshooting Guide available to you. If it is impossible to solve the problem, provide the dump and the technical information above to the support team. You can restart the procedure from the last task with the following command: /opt/kubemarine/kubemarine_ee/__main__.py --tasks="kubernetes"

This issue is appeared, than containerd is restarted, but is not initialized yet.

The fix, provided in #694, does not help, because of ctr command specifics.

Solution

  1. ctr version command is moved to crictl version because crictl can detect the issue, appeared in upgrade procedure;
  2. This health check runs only for upgrade procedure, because it's not actual in install or add_node procedures, and at the same time crictl tool does not exist on the node for fresh installation (it's installed in next tasks);

Test Cases

TestCase 1

Test Configuration:

  • Hardware:
  • OS:
  • Inventory:

Steps:

Repeate several times (issue is reproduced in around 1/10 cases):

  1. Run kubemarine install 1.30 version;
  2. Run kubemarine upgrade to 1.31 version;

Results:

Before After
Some runs can fail during upgrade procedure with not initialized containerd All runs are successful

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • There is no breaking changes, or migration patch is provided
  • Integration CI passed
  • Unit tests. If Yes list of new/changed tests with brief description
  • There is no merge conflicts

Unit tests

Indicate new or changed unit tests and what they do, if any.

@koryaga koryaga added the bug Something isn't working label Nov 28, 2024
@theboringstuff theboringstuff merged commit c90bc68 into main Nov 28, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants