Skip to content

Conversation

jrfastab
Copy link
Contributor

@jrfastab jrfastab commented Mar 4, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 15048; do contrib/backporting/set-labels.py $pr done 1.8; done

@jrfastab jrfastab requested a review from a team as a code owner March 4, 2021 01:25
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.8 kind/backports This PR provides functionality previously merged into master. labels Mar 4, 2021
@jrfastab jrfastab added backport/1.8 kind/backports This PR provides functionality previously merged into master. labels Mar 4, 2021
@jrfastab
Copy link
Contributor Author

jrfastab commented Mar 4, 2021

test-backport-1.8

@aanm
Copy link
Member

aanm commented Mar 4, 2021

test-upstream-k8s

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed 1st commit

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff looks good compared to upstream commits, CI is all passing. Just needs a minor rebase.

jrfastab added 9 commits March 5, 2021 12:59
[ upstream commit 49fe92a ]

For encryption use cases we need the optional parameter in Xfrm
template. While we wait for those patches to land upstream point
at cilium/netlink with included patches.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit d02b6b7 ]

So far we've programmed FWD policy rules to mirror the IN policy rules,
but this requires the outer IPSec tunnel dstIP and inner dstIP cidr to
be non-overlapping. Also we've been setting the template IPs the same
as the matching src/dst IPs. We are about to fix the ENI case where these
assumptions are no longer true.

In order to support this allow pushing unique FWD/IN policies and
also specify matching IPs and template IPs.

Note: It probably makes sense to do a heavier refactor here and rip
out UpsertIPsecEndpoint(), at this point it just adds confusion in
my opinion. But, lets avoid that at the moment for backports.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit 53650a8 ]

The XFRM stack will check for fwd policies after a decrypt or encrypt
operation is done. We don't really need these for policy reasons
because Cilium BPF side is enforcing policy. We can't just delete
the rules though. If we did the decrypted traffic, which will still
have the xfrm context encoded in the skb, will fail policy match and
generate a NOPOLICY error. But as is in ENI modes the template lookup
is failing with an TMPL_MISMATCH error.

To fix make the fwd rules less strict by adding the optional bit
to the rule. This will skip the template lookup altogether on the
kernel side.

Although this is needed for ENI cases where the packet is handled
by the routing stack we can enable it in all cases. Making other
cases less strict is also OK.

Before this we had fwd `ip x p` like this,

root@ip-192-168-90-141:/home/cilium# ip x p
src 0.0.0.0/0 dst 192.168.0.0/16
	dir fwd priority 0 ptype main
	tmpl src 0.0.0.0 dst 192.168.0.0
		proto esp reqid 1 mode tunnel

Now we have a 'level use' included indicating the policy is not strict,

root@ip-192-168-90-141:/home/cilium# ip x p
src 0.0.0.0/0 dst 192.168.0.0/16
	dir fwd priority 0 ptype main
	tmpl src 0.0.0.0 dst 192.168.0.0
		proto esp reqid 1 mode tunnel
		level use

This corresponds to kernel code,

 static inline int
 xfrm_policy_ok(const struct xfrm_tmpl *tmpl, const struct sec_path *sp, int start,
	       unsigned short family)
 {
	int idx = start;

	if (tmpl->optional) {
		if (tmpl->mode == XFRM_MODE_TRANSPORT)
			return start;
	} else
		start = -1;
	for (; idx < sp->len; idx++) {
		if (xfrm_state_ok(tmpl, sp->xvec[idx], family))
			return ++idx;
		if (sp->xvec[idx]->props.mode != XFRM_MODE_TRANSPORT) {
			if (start == -1)
				start = -2-idx;
			break;
		}
	}
	return start;
 }

And with 'level use' we hit the first 'if (tmpl->optional)' so that the
subsequent xfrm_state_ok checks are skipped.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit 86d78df ]

In ENI mode we use link IP for outer IPsec headers. In order for rules
to be configured with correct IPs do a link lookup to discover the
correct IP.

Without this patch we will be missing the 'in' policy rules shown here,

 root@ip-192-168-90-141:/home/cilium# ip x p
 src 0.0.0.0/0 dst 192.168.0.0/16
	dir fwd priority 0 ptype main
	tmpl src 0.0.0.0 dst 192.168.90.141
		proto esp reqid 1 mode tunnel
		level use
 src 0.0.0.0/0 dst 192.168.64.0/19
 	dir in priority 0 ptype main
	mark 0xd00/0xf00
	tmpl src 0.0.0.0 dst 192.168.90.141

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit 287f49c ]

If we have endpoint routes enabled we want to use the routing table
to find the lxc* device. Trying to do this with redirects doesn't
work and may cause cilium_host to drop the packet.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit eb1ca85 ]

TLDR: Remove 0xd00 rule/routes and let xfrm stack clear mark, but restrict
      overlapping pod/network CIDRs to 4.19 and above kernels.

Because ENI mode uses an outer IP that overlaps with the inner pod IPs we
need to be extra careful with the routing/rule configurations. For example
we currently put a route in the decrypt routing table that looks like
this,

	192.168.0.0/16 dev cilium_host mtu 9001

where 192.168.0.0/16 is the subnet IP in this example. The purpose of this
rule is twofold. First in the encryption direction it should match the
encrypted packet and send it to cilium_host for necessary rewrites. Second
on decryption there should be a simliar rule to match and send traffic to
the sending interface. Something like this,

	$POD_CIDR dev eth0 mtu

But, in the ENI case the decrypted CIDR and the encrypted CIDR are the
same so they collide and we get a single rule. Worse now received ESP
packets will match above and be routed to cilium_host before they can
be decrypted. cilium_host will become confused and send them to the
stack. At this point the mark set by bpf_network will be dropped and
the stack will miss the xfrm state lookup.

  eth0 (rx esp) -> bpf_packet (mark for decryption mark=0xd00)
                     -> rule finds 0x0d00 uses routing table 200
		     -> route hits {$CIDR dev cilium_host}
		     -> ESP packet on cilium host runs bpf_host
		     -> bpf_host return CTX_ACT_OK
		     -> cilium_host fwd to cilium_net mark = 0 now
		     -> ESP packet (dev=cilium_net) lookup xfrm state
		     -> XFRM_NOSTATE error because mark cleared

So everything went sideways after the 0xd00 routing table hit. The reason
we even have a 0xd00 routing rule is to clear the mark bits after decryption,
but before pushing to stack. Expected flow,

   eth0 (rx esp) -> bpf_packet (mark for decryption mark=0xd00)
                     -> rule finds 0x0d00 uses routing table 200
		     -> no match send to stack with mark=0xd00 dev=eth0
		     -> xfrm state lookup matches
		     -> decrypted packet mark=0xd00
                        rule for 0x0d00 uses tbl 200
		     -> route decrypted packet to eth0
 		     -> bpf_packet runs clears mark and sends to stack

At this point a decrypted packet is sent to stack just as if it had arrived
decrypted. We could remove the steps after the xfrm decrypt if the xfrm stack
cleared the mark with an output_mark action in xfrm. We haven't done this yet
because it doesn't work pre-4.19 kernels. For ENI + overlapping subnets we
will put the 4.19 kernel requirement and remove the 0xd00 rule completely.
This simplifies datapath to just,

  eth0 (rx esp) -> bpf_packet mark -> decrypt -> stack ...

Much cleaner, but only usable where the output mark can be set. So we
keep the fallback path to use the mark if the mark action is not
available.

Before this patch two ip rules will be set for encrypt and decrypt,

 $ ip rule
   1:      from all fwmark 0xe00/0xf00 lookup 200
   1:      from all fwmark 0xd00/0xf00 lookup 200
   ...

After this patch only the encrypt rule will be set,

 $ ip rule
   1:      from all fwmark 0xe00/0xf00 lookup 200
   ...

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit 0314553 ]

Make the fwd policy rule more permissive and remove the mark. This
way it will match in both cases: the mark is set because we are
running without subnet set, the case where the mark is 0 in the
ENI case.

Before this patch an example fwd policy was,

src 0.0.0.0/32 dst 192.168.32.0/19
        dir fwd priority 0 ptype main
        mark 0xd00/0xf00
        tmpl src 0.0.0.0 dst 192.168.62.241
                proto esp reqid 1 mode tunnel
                level use

Notice the 'mark 0xd00/0xf00' case restricting the match. After thiis
patch we drop the mark, which will make the match more permissive.

src 0.0.0.0/32 dst 192.168.32.0/19
        dir fwd priority 0 ptype main
        tmpl src 0.0.0.0 dst 192.168.62.241
                proto esp reqid 1 mode tunnel
                level use

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit bc8bb3c ]

In endpoint route modes we have routes in the system to the endpoints so we
can skip the route back to the interface with bpf_network.o set, previously
used to do the redirect needed when the stack can not route.

This will work back to 4.14 kernels where XFRM_OUTPUT_MARK was added. If we
need to go back further an alternative route/mark/rule will need to be
added.

Before this patch we had in state rules like this,

src 0.0.0.0 dst 192.168.62.241
        proto esp spi 0x00000003 reqid 1 mode tunnel
        replay-window 0
        mark 0xd00/0xf00 output-mark 0xd00/0xf00

The above will use the 'output-mark' field to set the skb->mark after
decryption to 0xd00. But, after patch (and with endpoint routes enables)
we set the output-mark to 0 with the following state rule.

src 0.0.0.0 dst 192.168.62.241
        proto esp spi 0x00000003 reqid 1 mode tunnel
        replay-window 0
        mark 0xd00/0xf00 output-mark 0x0/0xf00

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
[ upstream commit 428b0b0 ]

We currently ignore the user provided interface in ENI modes. Lets not
overwrite user input if they specify it.

Fixes: 2c5e192 ("encryption/loader: Attach bpf_network program to correct interface for ENI mode")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
@jrfastab jrfastab force-pushed the pr/v1.8-backport-2021-03-03 branch from ea3fae2 to bc555f9 Compare March 5, 2021 21:05
@jrfastab
Copy link
Contributor Author

jrfastab commented Mar 5, 2021

test-backport-1.8

@joestringer
Copy link
Member

Travis job is actually green behind the scenes, it's just that the update to the jobs summary on the github page didn't get propagated properly somehow. https://travis-ci.com/github/cilium/cilium/jobs/488669666

@joestringer joestringer merged commit 324763c into v1.8 Mar 6, 2021
@joestringer joestringer deleted the pr/v1.8-backport-2021-03-03 branch March 6, 2021 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants