-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Bug report
Symptoms
Tested that when upgrading from 1.7.4
to 1.8.2
, Pod traffic will get interrupted for 1~10
seconds, or even longer.
Clarify that this does not relate to the NAT、CT configuration changes as specified in 1.8 upgrade guide.
Affected version
According to the code (see blow), we assume that upgrading from 1.7.x
-> 1.8.x
will encounter this problem.
Digging inside
Taking 1.8.2
and 1.7.4
implementation as references in the below.
1. Agent removes cilium_policy
map on start, before cilium_call_policy
is correctly setup
Endpoints rely on this BPF map for tail calls.
This map is renamed from cilium_policy
(1.7.4) to cilium_call_policy
(1.8.2), but this commit removes the old one on startup - before the new one correctly sets up. All endpoints on this node will get interrupted immediately.
As a quick test, reverting this commit could solve this interruption.
2. Agent created cilium_call_policy
map before all endpoints' maps reloaded
There are two sequential steps in 1.8.2:
- Reload BPF for host device (e.g.
bond1
of the node, orcilium_host
), this will createcilium_call_policy
. - Reloads BPF for all the endpoints (pods) on this node. This will update the
endpoint -> entrypoint of tail call
mappings incilium_call_policy
.
As step 1 happens before step 2, cilium_call_policy
will miss all the tail calls before an endpoint is reloaded, which is another reason that leads to pod traffic interruption.
We could see the drop types are indeeded Missed tail calls
either in the metrics, or with cilium monitor --type=drop
:
$ cilium monitor --type=drop
xx drop (Missed tail call) flow 0x0 to endpoint 0, identity 2->0: 10.5.2.91 -> 10.6.8.106 EchoRequest
xx drop (Missed tail call) flow 0x0 to endpoint 0, identity 2->0: 10.5.2.91 -> 10.6.8.93 EchoRequest
xx drop (Missed tail call) flow 0x0 to endpoint 0, identity 2->0: 10.5.2.91 -> 10.6.8.106 EchoRequest
...
where 10.6.8.106
and 10.6.8.93
are IP addresses of Pods running on this node.
CC @jaffcheng