Skip to content

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Sep 2, 2020

The name of the policy call map on the bpffs was changed between 1.7 and 1.8 by commit 5d6b669. Commit 6bada02 then added code to delete the old map name on initialization of the agent.

We cannot simply delete the old policy call map because it might still be used by endpoints (until they are regenerated) and the base programs (until they are reloaded). However, if we rename the map in the bpffs, it shouldn't affect BPF programs that are using it and they will then pick up the new name on reload.

The reverse renaming operation is needed in 1.7 to allow for downgrades from 1.8.

Updates: #13015
Fixes: #10626
Fixes: #10845

@pchaigno pchaigno added release-note/bug This PR fixes an issue in a previous release of Cilium. kind/backports This PR provides functionality previously merged into master. upgrade-impact This PR has potential upgrade or downgrade impact. backport/1.7 labels Sep 2, 2020
@pchaigno pchaigno force-pushed the pr/pchaigno/fix-policycallmap-downgrade branch from 8f8100b to cec6196 Compare September 2, 2020 20:40
@pchaigno pchaigno marked this pull request as ready for review September 4, 2020 20:23
@pchaigno pchaigno requested a review from a team as a code owner September 4, 2020 20:23
pchaigno added a commit that referenced this pull request Sep 4, 2020
Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented Sep 4, 2020

I manually tested this fix by creating a new Docker image, docker.io/cilium/cilium-dev:v1.7-with-downgrade-fix (still available). I then backported #13051 (the v1.8 upgrade fix) and #13097 (the extended K8sUpdates test written to detect missed tail call drops) to v1.8. I applied the following diff to be able to test the upgrade/downgrade path with K8sUpdates:
v1.7-with-downgrade-fix -> v1.8-with-upgrade-fix -> v1.7-with-downgrade-fix.

diff --git a/test/helpers/cons.go b/test/helpers/cons.go
index 4e4283e0c..66875cf6e 100644
--- a/test/helpers/cons.go
+++ b/test/helpers/cons.go
@@ -193,7 +193,7 @@ const (
 	// CiliumStableHelmChartVersion should be the chart version that points
 	// to the v1.X branch
 	CiliumStableHelmChartVersion = "1.7-dev"
-	CiliumStableVersion          = "v1.7"
+	CiliumStableVersion          = "v1.7-with-downgrade-fix"
 	CiliumLatestHelmChartVersion = "1.7.90"
 
 	MonitorLogFileName = "monitor.log"
diff --git a/test/k8sT/Updates.go b/test/k8sT/Updates.go
index 76481a9cc..d74ca8bab 100644
--- a/test/k8sT/Updates.go
+++ b/test/k8sT/Updates.go
@@ -129,15 +129,6 @@ func InstallAndValidateCiliumUpgrades(kubectl *helpers.Kubectl, oldHelmChartVers
 		err          error
 	)
 
-	canRun, err := helpers.CanRunK8sVersion(oldImageVersion, helpers.GetCurrentK8SEnv())
-	ExpectWithOffset(1, err).To(BeNil(), "Unable to get k8s constraints for %s", oldImageVersion)
-	if !canRun {
-		Skip(fmt.Sprintf(
-			"Cilium %q is not supported in K8s %q. Skipping upgrade/downgrade tests.",
-			oldImageVersion, helpers.GetCurrentK8SEnv()))
-		return func() {}, func() {}
-	}
-
 	SkipIfIntegration(helpers.CIIntegrationFlannel)
 
 	if helpers.RunsWithoutKubeProxy() {
@@ -245,15 +236,13 @@ func InstallAndValidateCiliumUpgrades(kubectl *helpers.Kubectl, oldHelmChartVers
 		cleanupCiliumState(filepath.Join(kubectl.BasePath(), helpers.HelmTemplate), newHelmChartVersion, "", newImageVersion, "")
 
 		By("Cleaning Cilium state (%s)", oldImageVersion)
-		cleanupCiliumState("cilium/cilium", oldHelmChartVersion, "cilium", oldImageVersion, "docker.io/cilium")
+		cleanupCiliumState("cilium/cilium", oldHelmChartVersion, "cilium-dev", oldImageVersion, "docker.io/cilium")
 
 		By("Deploying Cilium %s", oldHelmChartVersion)
 
 		opts := map[string]string{
-			"global.tag":      oldImageVersion,
-			"global.registry": "docker.io/cilium",
-			"agent.image":     "cilium",
-			"operator.image":  "operator",
+			"agent.image":     "docker.io/cilium/cilium-dev:"+oldImageVersion,
+			"operator.image":  "docker.io/cilium/operator:v1.7",
 		}
 		if helpers.RunsWithoutKubeProxy() || helpers.RunsOn419Kernel() {
 			opts["global.nodePort.device"] = privateIface

I ran the test in a loop 37 times and it didn't report any missed tail calls with the two fixes applied. Without any of the two fixes, drops are reported (either on upgrade or on downgrade).

The name of the policy call map on the bpffs was changed between 1.7 and
1.8 by commit 5d6b669. Commit 6bada02 then added code to delete the old
map name on initialization of the agent.

We cannot simply delete the old policy call map because it might still
be used by endpoints (until they are regenerated) and the base programs
(until they are reloaded). However, if we rename the map in the bpffs,
it shouldn't affect BPF programs that are using it and they will then
pick up the new name on reload.

The reverse renaming operation is needed in 1.7 to allow for downgrades
from 1.8.

Fixes: 5d6b669 ("maps/policymap: Rename policy call map to clarify intent")
Fixes: 6bada02 ("daemon: Remove old policy call map")
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno force-pushed the pr/pchaigno/fix-policycallmap-downgrade branch from cec6196 to 5f4e110 Compare September 7, 2020 06:08
pchaigno added a commit that referenced this pull request Sep 7, 2020
Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented Sep 7, 2020

test-backport-1.7

@pchaigno
Copy link
Member Author

pchaigno commented Sep 7, 2020

Previous run: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-K8s/3484/
Only K8sFQDNTest Restart Cilium validate that FQDN is still working seems to be failing.
test-missed-k8s

aanm pushed a commit that referenced this pull request Sep 7, 2020
Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
@aanm aanm merged commit 08a4294 into v1.7 Sep 7, 2020
@aanm aanm deleted the pr/pchaigno/fix-policycallmap-downgrade branch September 7, 2020 17:56
fristonio pushed a commit that referenced this pull request Sep 9, 2020
[ upstream commit e9b3844 ]

Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
fristonio pushed a commit that referenced this pull request Sep 9, 2020
[ upstream commit e9b3844 ]

Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
aanm pushed a commit that referenced this pull request Sep 9, 2020
[ upstream commit e9b3844 ]

Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
aditighag pushed a commit that referenced this pull request Sep 10, 2020
[ upstream commit e9b3844 ]

Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. release-note/bug This PR fixes an issue in a previous release of Cilium. upgrade-impact This PR has potential upgrade or downgrade impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants