Skip to content

Conversation

jaffcheng
Copy link
Contributor

Please see commit msg

Fixes: #15878

Fix transient policy deny during agent restart

@jaffcheng jaffcheng requested a review from a team as a code owner August 10, 2021 07:43
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 10, 2021
@jaffcheng jaffcheng requested a review from jrajahalme August 10, 2021 07:43
@pchaigno pchaigno added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Aug 10, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 10, 2021
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes!

How did you find and validate this? Was it enough to restart Cilium with chatty pods and policies in place? If so, why did we miss those issues in CI?

@pchaigno
Copy link
Member

test-me-please

@jaffcheng
Copy link
Contributor Author

jaffcheng commented Aug 11, 2021

How did you find and validate this? Was it enough to restart Cilium with chatty pods and policies in place? If so, why did we miss those issues in CI?

Hi @pchaigno, my test setup to reproduce is like the following: enforce ingress policy (that allows access from clientpod) on serverpod on node1, and create a chatty clientpod (nc -zw1 $serverip $port every 5ms) on another node, then restart cilium-agent on node1.
With the above setup, the policy flush issue should be easy to reproduce, and the probability to reproduce the ipcache missing issue might depend on the number of entries in ipcache map(over 55000 in the above setup). I haven't read the CI code, but I guess it might depend on the scale of the CI test setup and how chatty the client is, hope this helps.

@pchaigno
Copy link
Member

pchaigno commented Aug 11, 2021

I'm guessing the CI 3.0 workflows didn't trigger because of yesterday's GitHub outage:
ci-awscni
ci-aks
ci-eks
ci-gke

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@pchaigno
Copy link
Member

The only failing test is on EKS with IPsec and L7 policies; it therefore corresponds to #17139, a known broken test. Review requests are covered. I'm marking ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 11, 2021
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find, thanks for the submission!

I've got some concerns about the first patch. I have proposed an alternative solution that I think should be able to achieve the goal of fixing the transient policy deny during agent restart but without deferring the policy synchronization too late.

@@ -298,10 +298,13 @@ func (e *Endpoint) restoreIdentity() error {
return ErrNotAlive
case <-gotInitialGlobalIdentities:
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second patch LGTM.

I'm curious, do you have an easy way to reproduce these connectivity issues during agent restart? We have some tests like the Upgrade tests that attempt to do this, but it sounds like your testing is able to more reliably pick up on issues like this. If we can integrate such tests into the Cilium CI, we can hopefully catch these bugs in future before they get merged into the tree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there was already discussion on this above in the PR, I missed that initially :)

We do not need to resolve this testing question to merge the fix, but it would be a nice follow-up to prevent future regressions.

@joestringer joestringer removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 11, 2021
@jaffcheng jaffcheng force-pushed the fix-policy-deny-during-restart-upstream branch from 9a038a7 to 7d15312 Compare August 12, 2021 07:15
@jaffcheng jaffcheng requested a review from joestringer August 12, 2021 07:47
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks.

@joestringer
Copy link
Member

test-me-please

@rewiko
Copy link

rewiko commented Aug 13, 2021

#15878 (comment) @yorg1st @chaosbox should be able to do it next week.
But ideally, cilium pipeline should have a regression test 😃, thanks for the PR

@joestringer
Copy link
Member

joestringer commented Aug 13, 2021

I did some (very brief) triage of the failing tests. If we click through the "Details" links and scan through, we find:

Kind:

AKS:

EKS:

However, I would not expect this PR to have an effect on the tests. This PR adjusts the "endpoint restore" logic which is run when Cilium restarts with active workloads. As far as I know, the tests never restart Cilium and will only start pods after Cilium started and stop them before it shuts down. So I don't think they're related to the code being changed.

I have opened a thread in Slack #testing channel to see whether anyone knows anything about such failures.

I didn't yet search the GitHub issues to see whether there are similar reports; that would be another good next step, try to correlate with other reports and see whether there was a regression introduced into the tree recently.

Currently, during endpoint restoration, policy maps are flushed
before they are refilled, which introduces transient policy deny.

Instead of flushing all policy map entries while restoring/initializing
endpoint policyMap, this patch synchronizes the in-memory realized state
with BPF map entries, and any potential discrepancy between desired
and realized state would be dealt with by the following e.syncPolicyMap.

Fixes: cilium#15878
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Currently, during endpoint restoration, ipcache map is unpinned and
recreated by Map.OpenParallel, regenerated endpoints lookup from
the newly created ipcache map, so the regeneration of endpoints should
wait for the ipcache map to be synchronized. However, the regeneration
of host endpoint doesn't wait for ipcache map sync, which introduces
transient policy deny.

This patch fixes this by making all endpoints wait for ipcache map sync.

Fixes: cilium#15878
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
@jaffcheng jaffcheng force-pushed the fix-policy-deny-during-restart-upstream branch from 7d15312 to 5c67642 Compare August 16, 2021 04:41
@jaffcheng
Copy link
Contributor Author

Thanks for the investigation! Rebased

@pchaigno
Copy link
Member

pchaigno commented Aug 16, 2021

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests LoadBalancer Connectivity to endpoint via LB

Failure Output

FAIL: Can not connect to service "http://192.168.1.146" from outside cluster (1/10)

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

@jaffcheng
Copy link
Contributor Author

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests LoadBalancer Connectivity to endpoint via LB

Failure Output

FAIL: Can not connect to service "http://192.168.1.146" from outside cluster (1/10)

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

I guess this is #16399 ?

@pchaigno
Copy link
Member

I guess this is #16399 ?

Yes. So only failing test is a known flake and review requests are covered. Marking ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 17, 2021
@ti-mo ti-mo merged commit ff7bacb into cilium:master Aug 17, 2021
@jaffcheng jaffcheng deleted the fix-policy-deny-during-restart-upstream branch August 17, 2021 17:22
@rewiko
Copy link

rewiko commented Aug 20, 2021

Nice 🌞 , will it be backported to 1.9 or only 1.10 ?

@pchaigno
Copy link
Member

We'll backport to both.

@aditighag
Copy link
Member

1.9 backport for this commit isn't straightforward, see this thread - #17328 (comment).

@rewiko
Copy link

rewiko commented Sep 9, 2021

That's what @yorg1st told me when he was trying to deploy on their clusters with the fix.

Up to you but it's probably safer to backport only for 1.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drop happening during Cilium-agent restart