-
Notifications
You must be signed in to change notification settings - Fork 3.4k
daemon: Fix handling of policy call map on downgrades #13052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8f8100b
to
cec6196
Compare
Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io>
I manually tested this fix by creating a new Docker image, diff --git a/test/helpers/cons.go b/test/helpers/cons.go
index 4e4283e0c..66875cf6e 100644
--- a/test/helpers/cons.go
+++ b/test/helpers/cons.go
@@ -193,7 +193,7 @@ const (
// CiliumStableHelmChartVersion should be the chart version that points
// to the v1.X branch
CiliumStableHelmChartVersion = "1.7-dev"
- CiliumStableVersion = "v1.7"
+ CiliumStableVersion = "v1.7-with-downgrade-fix"
CiliumLatestHelmChartVersion = "1.7.90"
MonitorLogFileName = "monitor.log"
diff --git a/test/k8sT/Updates.go b/test/k8sT/Updates.go
index 76481a9cc..d74ca8bab 100644
--- a/test/k8sT/Updates.go
+++ b/test/k8sT/Updates.go
@@ -129,15 +129,6 @@ func InstallAndValidateCiliumUpgrades(kubectl *helpers.Kubectl, oldHelmChartVers
err error
)
- canRun, err := helpers.CanRunK8sVersion(oldImageVersion, helpers.GetCurrentK8SEnv())
- ExpectWithOffset(1, err).To(BeNil(), "Unable to get k8s constraints for %s", oldImageVersion)
- if !canRun {
- Skip(fmt.Sprintf(
- "Cilium %q is not supported in K8s %q. Skipping upgrade/downgrade tests.",
- oldImageVersion, helpers.GetCurrentK8SEnv()))
- return func() {}, func() {}
- }
-
SkipIfIntegration(helpers.CIIntegrationFlannel)
if helpers.RunsWithoutKubeProxy() {
@@ -245,15 +236,13 @@ func InstallAndValidateCiliumUpgrades(kubectl *helpers.Kubectl, oldHelmChartVers
cleanupCiliumState(filepath.Join(kubectl.BasePath(), helpers.HelmTemplate), newHelmChartVersion, "", newImageVersion, "")
By("Cleaning Cilium state (%s)", oldImageVersion)
- cleanupCiliumState("cilium/cilium", oldHelmChartVersion, "cilium", oldImageVersion, "docker.io/cilium")
+ cleanupCiliumState("cilium/cilium", oldHelmChartVersion, "cilium-dev", oldImageVersion, "docker.io/cilium")
By("Deploying Cilium %s", oldHelmChartVersion)
opts := map[string]string{
- "global.tag": oldImageVersion,
- "global.registry": "docker.io/cilium",
- "agent.image": "cilium",
- "operator.image": "operator",
+ "agent.image": "docker.io/cilium/cilium-dev:"+oldImageVersion,
+ "operator.image": "docker.io/cilium/operator:v1.7",
}
if helpers.RunsWithoutKubeProxy() || helpers.RunsOn419Kernel() {
opts["global.nodePort.device"] = privateIface I ran the test in a loop 37 times and it didn't report any missed tail calls with the two fixes applied. Without any of the two fixes, drops are reported (either on upgrade or on downgrade). |
The name of the policy call map on the bpffs was changed between 1.7 and 1.8 by commit 5d6b669. Commit 6bada02 then added code to delete the old map name on initialization of the agent. We cannot simply delete the old policy call map because it might still be used by endpoints (until they are regenerated) and the base programs (until they are reloaded). However, if we rename the map in the bpffs, it shouldn't affect BPF programs that are using it and they will then pick up the new name on reload. The reverse renaming operation is needed in 1.7 to allow for downgrades from 1.8. Fixes: 5d6b669 ("maps/policymap: Rename policy call map to clarify intent") Fixes: 6bada02 ("daemon: Remove old policy call map") Signed-off-by: Paul Chaignon <paul@cilium.io>
cec6196
to
5f4e110
Compare
Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io>
test-backport-1.7 |
Previous run: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-K8s/3484/ |
Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e9b3844 ] Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
[ upstream commit e9b3844 ] Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
[ upstream commit e9b3844 ] Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
[ upstream commit e9b3844 ] Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
The name of the policy call map on the bpffs was changed between 1.7 and 1.8 by commit 5d6b669. Commit 6bada02 then added code to delete the old map name on initialization of the agent.
We cannot simply delete the old policy call map because it might still be used by endpoints (until they are regenerated) and the base programs (until they are reloaded). However, if we rename the map in the bpffs, it shouldn't affect BPF programs that are using it and they will then pick up the new name on reload.
The reverse renaming operation is needed in 1.7 to allow for downgrades from 1.8.
Updates: #13015
Fixes: #10626
Fixes: #10845