-
Notifications
You must be signed in to change notification settings - Fork 3.4k
v1.8 backports 2021-03-03 #15193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.8 backports 2021-03-03 #15193
Conversation
test-backport-1.8 |
test-upstream-k8s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed 1st commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff looks good compared to upstream commits, CI is all passing. Just needs a minor rebase.
[ upstream commit 49fe92a ] For encryption use cases we need the optional parameter in Xfrm template. While we wait for those patches to land upstream point at cilium/netlink with included patches. Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit d02b6b7 ] So far we've programmed FWD policy rules to mirror the IN policy rules, but this requires the outer IPSec tunnel dstIP and inner dstIP cidr to be non-overlapping. Also we've been setting the template IPs the same as the matching src/dst IPs. We are about to fix the ENI case where these assumptions are no longer true. In order to support this allow pushing unique FWD/IN policies and also specify matching IPs and template IPs. Note: It probably makes sense to do a heavier refactor here and rip out UpsertIPsecEndpoint(), at this point it just adds confusion in my opinion. But, lets avoid that at the moment for backports. Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit 53650a8 ] The XFRM stack will check for fwd policies after a decrypt or encrypt operation is done. We don't really need these for policy reasons because Cilium BPF side is enforcing policy. We can't just delete the rules though. If we did the decrypted traffic, which will still have the xfrm context encoded in the skb, will fail policy match and generate a NOPOLICY error. But as is in ENI modes the template lookup is failing with an TMPL_MISMATCH error. To fix make the fwd rules less strict by adding the optional bit to the rule. This will skip the template lookup altogether on the kernel side. Although this is needed for ENI cases where the packet is handled by the routing stack we can enable it in all cases. Making other cases less strict is also OK. Before this we had fwd `ip x p` like this, root@ip-192-168-90-141:/home/cilium# ip x p src 0.0.0.0/0 dst 192.168.0.0/16 dir fwd priority 0 ptype main tmpl src 0.0.0.0 dst 192.168.0.0 proto esp reqid 1 mode tunnel Now we have a 'level use' included indicating the policy is not strict, root@ip-192-168-90-141:/home/cilium# ip x p src 0.0.0.0/0 dst 192.168.0.0/16 dir fwd priority 0 ptype main tmpl src 0.0.0.0 dst 192.168.0.0 proto esp reqid 1 mode tunnel level use This corresponds to kernel code, static inline int xfrm_policy_ok(const struct xfrm_tmpl *tmpl, const struct sec_path *sp, int start, unsigned short family) { int idx = start; if (tmpl->optional) { if (tmpl->mode == XFRM_MODE_TRANSPORT) return start; } else start = -1; for (; idx < sp->len; idx++) { if (xfrm_state_ok(tmpl, sp->xvec[idx], family)) return ++idx; if (sp->xvec[idx]->props.mode != XFRM_MODE_TRANSPORT) { if (start == -1) start = -2-idx; break; } } return start; } And with 'level use' we hit the first 'if (tmpl->optional)' so that the subsequent xfrm_state_ok checks are skipped. Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit 86d78df ] In ENI mode we use link IP for outer IPsec headers. In order for rules to be configured with correct IPs do a link lookup to discover the correct IP. Without this patch we will be missing the 'in' policy rules shown here, root@ip-192-168-90-141:/home/cilium# ip x p src 0.0.0.0/0 dst 192.168.0.0/16 dir fwd priority 0 ptype main tmpl src 0.0.0.0 dst 192.168.90.141 proto esp reqid 1 mode tunnel level use src 0.0.0.0/0 dst 192.168.64.0/19 dir in priority 0 ptype main mark 0xd00/0xf00 tmpl src 0.0.0.0 dst 192.168.90.141 Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit 287f49c ] If we have endpoint routes enabled we want to use the routing table to find the lxc* device. Trying to do this with redirects doesn't work and may cause cilium_host to drop the packet. Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit eb1ca85 ] TLDR: Remove 0xd00 rule/routes and let xfrm stack clear mark, but restrict overlapping pod/network CIDRs to 4.19 and above kernels. Because ENI mode uses an outer IP that overlaps with the inner pod IPs we need to be extra careful with the routing/rule configurations. For example we currently put a route in the decrypt routing table that looks like this, 192.168.0.0/16 dev cilium_host mtu 9001 where 192.168.0.0/16 is the subnet IP in this example. The purpose of this rule is twofold. First in the encryption direction it should match the encrypted packet and send it to cilium_host for necessary rewrites. Second on decryption there should be a simliar rule to match and send traffic to the sending interface. Something like this, $POD_CIDR dev eth0 mtu But, in the ENI case the decrypted CIDR and the encrypted CIDR are the same so they collide and we get a single rule. Worse now received ESP packets will match above and be routed to cilium_host before they can be decrypted. cilium_host will become confused and send them to the stack. At this point the mark set by bpf_network will be dropped and the stack will miss the xfrm state lookup. eth0 (rx esp) -> bpf_packet (mark for decryption mark=0xd00) -> rule finds 0x0d00 uses routing table 200 -> route hits {$CIDR dev cilium_host} -> ESP packet on cilium host runs bpf_host -> bpf_host return CTX_ACT_OK -> cilium_host fwd to cilium_net mark = 0 now -> ESP packet (dev=cilium_net) lookup xfrm state -> XFRM_NOSTATE error because mark cleared So everything went sideways after the 0xd00 routing table hit. The reason we even have a 0xd00 routing rule is to clear the mark bits after decryption, but before pushing to stack. Expected flow, eth0 (rx esp) -> bpf_packet (mark for decryption mark=0xd00) -> rule finds 0x0d00 uses routing table 200 -> no match send to stack with mark=0xd00 dev=eth0 -> xfrm state lookup matches -> decrypted packet mark=0xd00 rule for 0x0d00 uses tbl 200 -> route decrypted packet to eth0 -> bpf_packet runs clears mark and sends to stack At this point a decrypted packet is sent to stack just as if it had arrived decrypted. We could remove the steps after the xfrm decrypt if the xfrm stack cleared the mark with an output_mark action in xfrm. We haven't done this yet because it doesn't work pre-4.19 kernels. For ENI + overlapping subnets we will put the 4.19 kernel requirement and remove the 0xd00 rule completely. This simplifies datapath to just, eth0 (rx esp) -> bpf_packet mark -> decrypt -> stack ... Much cleaner, but only usable where the output mark can be set. So we keep the fallback path to use the mark if the mark action is not available. Before this patch two ip rules will be set for encrypt and decrypt, $ ip rule 1: from all fwmark 0xe00/0xf00 lookup 200 1: from all fwmark 0xd00/0xf00 lookup 200 ... After this patch only the encrypt rule will be set, $ ip rule 1: from all fwmark 0xe00/0xf00 lookup 200 ... Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit 0314553 ] Make the fwd policy rule more permissive and remove the mark. This way it will match in both cases: the mark is set because we are running without subnet set, the case where the mark is 0 in the ENI case. Before this patch an example fwd policy was, src 0.0.0.0/32 dst 192.168.32.0/19 dir fwd priority 0 ptype main mark 0xd00/0xf00 tmpl src 0.0.0.0 dst 192.168.62.241 proto esp reqid 1 mode tunnel level use Notice the 'mark 0xd00/0xf00' case restricting the match. After thiis patch we drop the mark, which will make the match more permissive. src 0.0.0.0/32 dst 192.168.32.0/19 dir fwd priority 0 ptype main tmpl src 0.0.0.0 dst 192.168.62.241 proto esp reqid 1 mode tunnel level use Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit bc8bb3c ] In endpoint route modes we have routes in the system to the endpoints so we can skip the route back to the interface with bpf_network.o set, previously used to do the redirect needed when the stack can not route. This will work back to 4.14 kernels where XFRM_OUTPUT_MARK was added. If we need to go back further an alternative route/mark/rule will need to be added. Before this patch we had in state rules like this, src 0.0.0.0 dst 192.168.62.241 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0xd00/0xf00 output-mark 0xd00/0xf00 The above will use the 'output-mark' field to set the skb->mark after decryption to 0xd00. But, after patch (and with endpoint routes enables) we set the output-mark to 0 with the following state rule. src 0.0.0.0 dst 192.168.62.241 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0xd00/0xf00 output-mark 0x0/0xf00 Signed-off-by: John Fastabend <john.fastabend@gmail.com>
ea3fae2
to
bc555f9
Compare
test-backport-1.8 |
Travis job is actually green behind the scenes, it's just that the update to the jobs summary on the github page didn't get propagated properly somehow. https://travis-ci.com/github/cilium/cilium/jobs/488669666 |
Once this PR is merged, you can update the PR labels via: