policy: Fix transient policy deny during agent restart #17115

jaffcheng · 2021-08-10T07:43:06Z

Please see commit msg

Fix transient policy deny during agent restart

pchaigno

Thanks for the fixes!

How did you find and validate this? Was it enough to restart Cilium with chatty pods and policies in place? If so, why did we miss those issues in CI?

pchaigno · 2021-08-10T21:36:20Z

test-me-please

jaffcheng · 2021-08-11T05:12:08Z

How did you find and validate this? Was it enough to restart Cilium with chatty pods and policies in place? If so, why did we miss those issues in CI?

Hi @pchaigno, my test setup to reproduce is like the following: enforce ingress policy (that allows access from clientpod) on serverpod on node1, and create a chatty clientpod (nc -zw1 $serverip $port every 5ms) on another node, then restart cilium-agent on node1.
With the above setup, the policy flush issue should be easy to reproduce, and the probability to reproduce the ipcache missing issue might depend on the number of entries in ipcache map(over 55000 in the above setup). I haven't read the CI code, but I guess it might depend on the scale of the CI test setup and how chatty the client is, hope this helps.

pchaigno · 2021-08-11T09:29:01Z

I'm guessing the CI 3.0 workflows didn't trigger because of yesterday's GitHub outage:
ci-awscni
ci-aks
ci-eks
ci-gke

christarazi

🚀

pchaigno · 2021-08-11T19:23:21Z

The only failing test is on EKS with IPsec and L7 policies; it therefore corresponds to #17139, a known broken test. Review requests are covered. I'm marking ready to merge.

joestringer

Nice find, thanks for the submission!

I've got some concerns about the first patch. I have proposed an alternative solution that I think should be able to achieve the goal of fixing the transient policy deny during agent restart but without deferring the policy synchronization too late.

pkg/endpoint/bpf.go

joestringer · 2021-08-11T19:29:42Z

pkg/endpoint/restore.go

@@ -298,10 +298,13 @@ func (e *Endpoint) restoreIdentity() error {
 			return ErrNotAlive
 		case <-gotInitialGlobalIdentities:
 		}
+	}


Second patch LGTM.

I'm curious, do you have an easy way to reproduce these connectivity issues during agent restart? We have some tests like the Upgrade tests that attempt to do this, but it sounds like your testing is able to more reliably pick up on issues like this. If we can integrate such tests into the Cilium CI, we can hopefully catch these bugs in future before they get merged into the tree.

I see there was already discussion on this above in the PR, I missed that initially :)

We do not need to resolve this testing question to merge the fix, but it would be a nice follow-up to prevent future regressions.

joestringer

Looks good, thanks.

joestringer · 2021-08-12T15:49:27Z

test-me-please

rewiko · 2021-08-13T11:26:51Z

#15878 (comment) @yorg1st @chaosbox should be able to do it next week.
But ideally, cilium pipeline should have a regression test 😃, thanks for the PR

joestringer · 2021-08-13T16:30:01Z

I did some (very brief) triage of the failing tests. If we click through the "Details" links and scan through, we find:

Kind:

https://github.com/cilium/cilium/runs/3309030987?check_suite_focus=true
- ❌ to-entities-world/pod-to-world/http-to-jenkins-cilium-1: cilium-test/client2-666976c95b-t5lgs (10.244.1.107) -> jenkins-cilium-io-http (jenkins.cilium.io:80)

AKS:

https://github.com/cilium/cilium/runs/3313537647?check_suite_focus=true
- ❌ to-entities-world/pod-to-world/http-to-cilium-io-0: cilium-test/client2-666976c95b-fltfr (10.240.0.51) -> cilium-io-http (cilium.io:80)

EKS:

https://github.com/cilium/cilium/runs/3313559788?check_suite_focus=true
- ❌ client-egress-l7/pod-to-pod/curl-3: cilium-test/client2-666976c95b-lbhg4 (192.168.172.63) -> cilium-test/echo-same-node-7967996674-lcjpz (192.168.63.2:8080)
  ❌ echo-ingress-l7/pod-to-pod/curl-3: cilium-test/client2-666976c95b-lbhg4 (192.168.172.63) -> cilium-test/echo-same-node-7967996674-lcjpz (192.168.63.2:8080)
- EDIT: This is CI: new CLI 0.8.6's L7 tests failures on EKS with IPsec enabled #17139 , now that workflows: add test exceptions for failing L7 tests on EKS with IPsec #17140 is in it will be fixed by rebasing the PR against the master branch.

However, I would not expect this PR to have an effect on the tests. This PR adjusts the "endpoint restore" logic which is run when Cilium restarts with active workloads. As far as I know, the tests never restart Cilium and will only start pods after Cilium started and stop them before it shuts down. So I don't think they're related to the code being changed.

I have opened a thread in Slack #testing channel to see whether anyone knows anything about such failures.

I didn't yet search the GitHub issues to see whether there are similar reports; that would be another good next step, try to correlate with other reports and see whether there was a regression introduced into the tree recently.

Currently, during endpoint restoration, policy maps are flushed before they are refilled, which introduces transient policy deny. Instead of flushing all policy map entries while restoring/initializing endpoint policyMap, this patch synchronizes the in-memory realized state with BPF map entries, and any potential discrepancy between desired and realized state would be dealt with by the following e.syncPolicyMap. Fixes: cilium#15878 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

Currently, during endpoint restoration, ipcache map is unpinned and recreated by Map.OpenParallel, regenerated endpoints lookup from the newly created ipcache map, so the regeneration of endpoints should wait for the ipcache map to be synchronized. However, the regeneration of host endpoint doesn't wait for ipcache map sync, which introduces transient policy deny. This patch fixes this by making all endpoints wait for ipcache map sync. Fixes: cilium#15878 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

jaffcheng · 2021-08-16T04:43:54Z

Thanks for the investigation! Rebased

pchaigno · 2021-08-16T09:02:13Z

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests LoadBalancer Connectivity to endpoint via LB

Failure Output

FAIL: Can not connect to service "http://192.168.1.146" from outside cluster (1/10)

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

jaffcheng · 2021-08-17T04:00:19Z

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name
K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests LoadBalancer Connectivity to endpoint via LB
Failure Output
FAIL: Can not connect to service "http://192.168.1.146" from outside cluster (1/10)
If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

I guess this is #16399 ?

pchaigno · 2021-08-17T09:11:49Z

I guess this is #16399 ?

Yes. So only failing test is a known flake and review requests are covered. Marking ready to merge.

rewiko · 2021-08-20T10:11:59Z

Nice 🌞 , will it be backported to 1.9 or only 1.10 ?

pchaigno · 2021-08-20T15:42:57Z

We'll backport to both.

aditighag · 2021-09-08T14:41:23Z

1.9 backport for this commit isn't straightforward, see this thread - #17328 (comment).

rewiko · 2021-09-09T07:25:06Z

That's what @yorg1st told me when he was trying to deploy on their clusters with the fix.

Up to you but it's probably safer to backport only for 1.10.

jaffcheng requested a review from a team as a code owner August 10, 2021 07:43

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 10, 2021

jaffcheng requested a review from jrajahalme August 10, 2021 07:43

maintainer-s-little-helper bot assigned jrajahalme Aug 10, 2021

pchaigno added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Aug 10, 2021

maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 10, 2021

pchaigno approved these changes Aug 10, 2021

View reviewed changes

christarazi approved these changes Aug 11, 2021

View reviewed changes

pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 11, 2021

joestringer requested changes Aug 11, 2021

View reviewed changes

joestringer removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 11, 2021

jaffcheng force-pushed the fix-policy-deny-during-restart-upstream branch from 9a038a7 to 7d15312 Compare August 12, 2021 07:15

jaffcheng requested a review from joestringer August 12, 2021 07:47

maintainer-s-little-helper bot assigned joestringer Aug 12, 2021

joestringer approved these changes Aug 12, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned joestringer Aug 12, 2021

jaffcheng added 2 commits August 16, 2021 04:40

jaffcheng force-pushed the fix-policy-deny-during-restart-upstream branch from 7d15312 to 5c67642 Compare August 16, 2021 04:41

pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 17, 2021

ti-mo merged commit ff7bacb into cilium:master Aug 17, 2021

jaffcheng deleted the fix-policy-deny-during-restart-upstream branch August 17, 2021 17:22

pchaigno added needs-backport/1.9 labels Aug 20, 2021

tklauser mentioned this pull request Aug 23, 2021

v1.10 backports 2021-08-23 #17216

Merged

tklauser added backport-pending/1.10 and removed needs-backport/1.10 labels Aug 23, 2021

joestringer mentioned this pull request Sep 1, 2021

Prepare for release v1.10.4 #17287

Merged

aditighag mentioned this pull request Sep 7, 2021

v1.9 backports 2021-09-01 #17328

Merged

aditighag added backport-pending/1.9 and removed needs-backport/1.9 labels Sep 7, 2021

kaworu mentioned this pull request Sep 14, 2021

v1.9 backports 2021-09-14 #17390

Merged

kaworu added backport-pending/1.9 and removed needs-backport/1.9 labels Sep 14, 2021

aanm added backport-done/1.9 and removed backport-pending/1.9 labels Oct 2, 2021

joestringer mentioned this pull request Nov 5, 2021

Prepare for release v1.9.11 #17805

Merged

policy: Fix transient policy deny during agent restart #17115

policy: Fix transient policy deny during agent restart #17115

Uh oh!

Conversation

jaffcheng commented Aug 10, 2021

Uh oh!

pchaigno left a comment

Choose a reason for hiding this comment

Uh oh!

pchaigno commented Aug 10, 2021

Uh oh!

jaffcheng commented Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pchaigno commented Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christarazi left a comment

Choose a reason for hiding this comment

Uh oh!

pchaigno commented Aug 11, 2021

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joestringer Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

joestringer Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

joestringer commented Aug 12, 2021

Uh oh!

rewiko commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaffcheng commented Aug 16, 2021

Uh oh!

pchaigno commented Aug 16, 2021 • edited by maintainer-s-little-helper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Name

Failure Output

Uh oh!

jaffcheng commented Aug 17, 2021

Test Name

Failure Output

Uh oh!

pchaigno commented Aug 17, 2021

Uh oh!

rewiko commented Aug 20, 2021

Uh oh!

pchaigno commented Aug 20, 2021

Uh oh!

aditighag commented Sep 8, 2021

Uh oh!

rewiko commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jaffcheng commented Aug 11, 2021 •

edited

Loading

pchaigno commented Aug 11, 2021 •

edited

Loading

rewiko commented Aug 13, 2021 •

edited

Loading

joestringer commented Aug 13, 2021 •

edited

Loading

pchaigno commented Aug 16, 2021 •

edited by maintainer-s-little-helper bot

Loading

rewiko commented Sep 9, 2021 •

edited

Loading