Skip to content

Pod traffic get interrupted when upgrading from 1.7.x to 1.8.x #13015

@ArthurChiao

Description

@ArthurChiao

Bug report

Symptoms

Tested that when upgrading from 1.7.4 to 1.8.2, Pod traffic will get interrupted for 1~10 seconds, or even longer.

Clarify that this does not relate to the NAT、CT configuration changes as specified in 1.8 upgrade guide.

Affected version

According to the code (see blow), we assume that upgrading from 1.7.x -> 1.8.x will encounter this problem.

Digging inside

Taking 1.8.2 and 1.7.4 implementation as references in the below.

1. Agent removes cilium_policy map on start, before cilium_call_policy is correctly setup

Endpoints rely on this BPF map for tail calls.

This map is renamed from cilium_policy (1.7.4) to cilium_call_policy (1.8.2), but this commit removes the old one on startup - before the new one correctly sets up. All endpoints on this node will get interrupted immediately.

As a quick test, reverting this commit could solve this interruption.

2. Agent created cilium_call_policy map before all endpoints' maps reloaded

There are two sequential steps in 1.8.2:

  1. Reload BPF for host device (e.g. bond1 of the node, or cilium_host), this will create cilium_call_policy.
  2. Reloads BPF for all the endpoints (pods) on this node. This will update the endpoint -> entrypoint of tail call mappings in cilium_call_policy.

As step 1 happens before step 2, cilium_call_policy will miss all the tail calls before an endpoint is reloaded, which is another reason that leads to pod traffic interruption.

We could see the drop types are indeeded Missed tail calls either in the metrics, or with cilium monitor --type=drop:

$ cilium monitor --type=drop
xx drop (Missed tail call) flow 0x0 to endpoint 0, identity 2->0: 10.5.2.91 -> 10.6.8.106 EchoRequest
xx drop (Missed tail call) flow 0x0 to endpoint 0, identity 2->0: 10.5.2.91 -> 10.6.8.93 EchoRequest
xx drop (Missed tail call) flow 0x0 to endpoint 0, identity 2->0: 10.5.2.91 -> 10.6.8.106 EchoRequest
...

where 10.6.8.106 and 10.6.8.93 are IP addresses of Pods running on this node.

CC @jaffcheng

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/e2e-testThis issue is not covered by existing CI tests, but should be.priority/highThis is considered vital to an upcoming release.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions