Skip to content

bpf: Policy map precedence by LPM prefix length #35150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 8, 2024

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Oct 1, 2024

Policy map key can have nexthdr and destination port fields wildcarded. Cilium datapath typically performs two policy map lookups, one with the L3 identity mapped from the relevant IP address (destination or source IP for egress or ingress policy, respectively). Datapath chooses between the two policy entries based on two booleans in the policy entry: 'wildcarded_protocol' and 'wildcarded_dport'. The problem is that 'wildcarded_dport' is set only if the destination port is fully wildcarded, so partially wildcarded port is considered as "not wildcarded" for this logic.

Fix this for partially wildcarded ports by always choosing between L4-only and L3 policy matches by the key prefix length so that the most specific entry is always selected, selecting the entry with a specific identity if the prefix lengths are the same. That is, choose the policy with specific L3 if the prefixes are the same length, otherwise choose the policy with the longer prefix.

LPM map key already contains the prefix length, but unfortunately LPM map lookup does not return the map key. This is why we store an 8-bit version of the prefix length in the policy map entry, which is as follows:

0: both nexthdr and destination port fully wildcarded
8: nexthdr specified but port fully wildcarded
24: both nexthdr and destination port specified (no wildcarding)

Values between 8 and 24 represent the less or more wildcarded destination port.

Previously used wildcarded_protocol and wildcarded_dport flags are removed as redundant.

Before this change the precedence between the following two rules:

  1. <ID>/TCP/80/15 -- L3 specified, TCP port 80 partially wildcarded (ports 80-81)
  2. */TCP/80/16 -- L3 wildcarded, TCP port 80 not wildcarded

would have resulted in selecting the policy with <ID> specified even when it's destination port is partially wildcarded, while the L4-only rule has no wildcarding in it's destination port.

If the entries above were allow (1) and deny (3), Cilium agent masked this behavior by adding an additional deny entry in this case:

  1. <ID>/TCP/80/16 -- L3 <ID> specified, TCP port 80 not wildcarded

This masks the incorrect behavior for allow/deny case, since the non-wildcarded key (3) is given precedence by the datapath.

There is no such additional entry logic in the allow/allow case, so if the more specific L4-only policy entry has L7 policy, and the less specific L3 policy not, then matching traffic would not be sent to the L7 proxy for policy enforcement.

After this change the policy with less L4 wildcarding (the more specific L4) is given precedence, with L3 rule being selected when prefixes are of the same length. This allows for simplifying policy map computation by avoiding the need to add additional entries in cases where deny already has the longer prefix length.

This change makes no difference for keys which only have the destination port fully specified or fully wildcarded, so this only affects policies with port ranges as enabled in Cilium 1.16.

Removal of the additional deny entries enabled by this change make BenchmarkRegenerateCIDRDenyPolicyRules ~10% faster due to reduced allocations.

Fixes: #23885

Cilium datapath now gives precedence for the more specific allow rule with L7 rules when rules with port ranges are present.

@jrajahalme jrajahalme added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Oct 1, 2024
@jrajahalme jrajahalme requested a review from a team as a code owner October 1, 2024 12:41
@jrajahalme jrajahalme requested a review from gentoo-root October 1, 2024 12:41
@jrajahalme jrajahalme marked this pull request as draft October 1, 2024 12:41
@jrajahalme jrajahalme force-pushed the bpf-policy-map-precedence-fix branch from 28d4fb6 to 5c8007f Compare October 1, 2024 13:00
@jrajahalme jrajahalme added the needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch label Oct 1, 2024
@jrajahalme jrajahalme force-pushed the bpf-policy-map-precedence-fix branch from 5c8007f to 7817265 Compare October 1, 2024 13:09
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme force-pushed the bpf-policy-map-precedence-fix branch from 7817265 to 4d46d86 Compare October 1, 2024 13:18
@jrajahalme
Copy link
Member Author

/test

@gentoo-root
Copy link
Contributor

I don't feel like I fully get the idea, could you clarify please?

Before this change the precedence between the following two rules:

  1. /TCP/80/15 -- L3 specified, TCP port 80 partially wildcarded (ports 80-81)
  2. */TCP/80/16 -- L3 wildcarded, TCP port 80 not wildcarded

would have resulted in selecting the policy with specified even when it's destination port is partially wildcarded, while the L4-only rule has no wildcarding in it's destination port.

Meaning that 1 is preferred over 2? Is that not expected according to the comment in the code?

* 2. ANY/proto/port (L4-only)
* 3. id/proto/ANY   (L3-proto)

After this change the policy with less L4 wildcarding (the more specific L4) is given precedence, with L3 rule being selected when prefixes are of the same length.

I.e. 2 will be preferred over 1? What's the motivation of this change?

@jrajahalme jrajahalme force-pushed the bpf-policy-map-precedence-fix branch from 4d46d86 to 5824824 Compare October 2, 2024 15:58
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme force-pushed the bpf-policy-map-precedence-fix branch from 5824824 to 6e60cfd Compare October 3, 2024 05:50
@jrajahalme jrajahalme marked this pull request as ready for review October 3, 2024 05:51
@jrajahalme jrajahalme requested a review from a team as a code owner October 3, 2024 05:51
@jrajahalme jrajahalme requested a review from derailed October 3, 2024 05:51
@jrajahalme
Copy link
Member Author

I don't feel like I fully get the idea, could you clarify please?

Before this change the precedence between the following two rules:

  1. /TCP/80/15 -- L3 specified, TCP port 80 partially wildcarded (ports 80-81)

There is an unfortunate markdown rendering effect here, the source has "<ID>", which is rendered as an empty string. Note that the comment says "L3 specified".

  1. */TCP/80/16 -- L3 wildcarded, TCP port 80 not wildcarded

would have resulted in selecting the policy with specified even when it's destination port is partially wildcarded, while the L4-only rule has no wildcarding in it's destination port.

Meaning that 1 is preferred over 2? Is that not expected according to the comment in the code?

* 2. ANY/proto/port (L4-only)
* 3. id/proto/ANY   (L3-proto)

1 is preferred over 2 due to this:

 	 * 1. id/proto/port  (L3/L4)
 	 * 2. ANY/proto/port (L4-only)

This happens because the condition used for 1. evaluates true due to wildcarded_dport being true only if it is fully wildcarded:

   if (likely(policy && !policy->wildcard_dport)) {
		*match_type = POLICY_MATCH_L3_L4;		/* 1. id/proto/port */ 

After this change the policy with less L4 wildcarding (the more specific L4) is given precedence, with L3 rule being selected when prefixes are of the same length.

I.e. 2 will be preferred over 1? What's the motivation of this change?

Datapath only ever has to choose between two rules, one with full L3 identity match, other with L3 identity wildcarded (= 0). This change makes sure the policy with more specific proto/port is chosen out of these two. The lookup for these two cases is already doing that (among the matching L3 identity and wildcard L3 identity) due to the policy map being LPM.

@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme added the kind/bug This is a bug in the Cilium logic. label Oct 4, 2024
@jrajahalme jrajahalme requested review from nathanjsweet and removed request for derailed October 4, 2024 12:46
Copy link
Member

@nathanjsweet nathanjsweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 8, 2024
Consider following two policy map state entries:

1. Allow <ID>/TCP/80/15 -- L3 <ID> specified, TCP port 80 partially wildcarded (ports 80-81)
2. Deny */TCP/80/16 -- L3 wildcarded, TCP port 80 not wildcarded

When the deny entry is added, Cilium adds an additional deny entry:

3. Deny <ID>/TCP/80/16 -- L3 <ID> specified, TCP port 80 not wildcarded

This makes sure that the allow rule is not given precedence. This is
currently needed due to the logic datapath uses to select between the
L3-specific and L4-only entries when both are found.

There is no such additional entry logic in the allow/allow case, so we
rely on the datapath choosing the "right" entry. Eg., if the more
specific L4-only policy entry has L7 policy, and the less specific L3
policy not, then the datapath should select the more specific entry and
redirect the traffic to the L7 proxy for policy enforcement.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Policy map key can have nexthdr and destination port fields
wildcarded. Cilium datapath typically performs two policy map lookups,
one with the L3 identity mapped from the relevant IP address (destination
or source IP for egress or ingress policy, respectively). Datapath
chooses between the two policy entries based on two booleans in the
policy entry: 'wildcarded_protocol' and 'wildcarded_dport'. The problem
is that 'wildcarded_dport' is set only if the destination port is fully
wildcarded, so partially wildcarded port is considered as "not
wildcarded" for this logic.

Fix this for partially wildcarded ports by always choosing between
L4-only and L3 policy matches by the key prefix length so that the most
specific entry is always selected, selecting the entry with a precific
identity if the prefix lengths are the same. That is, choose the policy
with specific L3 if the prefixes are the same length, otherwise choose
the policy with the longer prefix.

LPM map key already contains the prefix length, but unfortunately LPM map
lookup does not return the map key. This is why we store an 8-bit version
of the prefix length in the policy map entry, which is as follows:

0: both nexthdr and destination port fully wildcarded
8: nexthdr specified but port fully wildcarded
24: both nexthdr and destination port specified (no wildcarding)

Values between 8 and 24 represent the less or more wildcarded destination port.

Previously used wildcarded_protocol and wildcarded_dport flags are removed as redundant.

Before this change the precedence between the following two rules:

1. <ID>/TCP/80/15 -- L3 <ID> specified, TCP port 80 partially wildcarded (ports 80-81)
2. */TCP/80/16 -- L3 wildcarded, TCP port 80 not wildcarded

would have resulted in selecting the policy with <ID> specified even when
it's destination port is partially wildcarded, while the L4-only rule has
no wildcarding in it's destination port.

If the entries above were allow (1) and deny (3), Cilium agent masked
this behavior by adding an additional deny entry in this case:

3. <ID>/TCP/80/16 -- L3 <ID> specified, TCP port 80 not wildcarded

This masks the incorrect behavior for allow/deny case, since the
non-wildcarded key (3) is given precedence by the datapath.

There is no such additional entry logic in the allow/allow case, so if
the more specific L4-only policy entry has L7 policy, and the less
specific L3 policy not, then matching traffic would not be sent to the L7
proxy for policy enforcement.

After this change the policy with less L4 wildcarding (the more specific
L4) is given precedence, with L3 rule being selected when prefixes are of
the same length. This allows for simplifying policy map computation by
avoiding the need to add additional entries in cases where deny already
has the longer prefix length.

This change makes no difference for keys which only have the destination
port fully specified or fully wildcarded, so this only affects policies
with port ranges as enabled in Cilium 1.16.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme force-pushed the bpf-policy-map-precedence-fix branch from 6e60cfd to 6861644 Compare October 8, 2024 19:55
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme enabled auto-merge October 8, 2024 19:55
@jrajahalme jrajahalme added this pull request to the merge queue Oct 8, 2024
Merged via the queue into cilium:main with commit ebb4c47 Oct 8, 2024
63 checks passed
@jrajahalme jrajahalme deleted the bpf-policy-map-precedence-fix branch October 8, 2024 21:19
@giorio94 giorio94 mentioned this pull request Oct 9, 2024
5 tasks
@giorio94 giorio94 added the backport/author The backport will be carried out by the author of the PR. label Oct 9, 2024
@giorio94
Copy link
Member

giorio94 commented Oct 9, 2024

Marked backport/author due to significant differences between v1.16 and main (due to #33719 and #33313 apparently).

jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Oct 25, 2024
Move the prefix length field in the policy map value to the flags (same
byte as 'deny') to a previously unused space. This cleans up padding
space for future use.

Skip map entries with inconsistent prefix length field in value vs. key
so that the entry will be rewritten with the correct prefix length field
in the value. This fixes a potential issue in upgrade where a policy map
entry may remain with zero valued prefix length in value leading to
incorrect policy enforcement.

Fixes: cilium#35150

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Oct 25, 2024
Move the prefix length field in the policy map value to the flags (same
byte as 'deny') to a previously unused space. This cleans up padding
space for future use.

Skip map entries with inconsistent prefix length field in value vs. key
so that the entry will be rewritten with the correct prefix length field
in the value. This fixes a potential issue in upgrade where a policy map
entry may remain with zero valued prefix length in value leading to
incorrect policy enforcement.

Fixes: cilium#35150

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Oct 25, 2024
Move the prefix length field in the policy map value to the flags (same
byte as 'deny') to a previously unused space. This cleans up padding
space for future use.

Skip map entries with inconsistent prefix length field in value vs. key
so that the entry will be rewritten with the correct prefix length field
in the value. This fixes a potential issue in upgrade where a policy map
entry may remain with zero valued prefix length in value leading to
incorrect policy enforcement.

Fixes: cilium#35150

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Oct 25, 2024
Move the prefix length field in the policy map value to the flags (same
byte as 'deny') to a previously unused space. This cleans up padding
space for future use.

Skip map entries with inconsistent prefix length field in value vs. key
so that the entry will be rewritten with the correct prefix length field
in the value. This fixes a potential issue in upgrade where a policy map
entry may remain with zero valued prefix length in value leading to
incorrect policy enforcement.

Fixes: #35150

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Oct 29, 2024
[ upstream commit 60bc8fa ]

Move the prefix length field in the policy map value to the flags (same
byte as 'deny') to a previously unused space. This cleans up padding
space for future use.

Skip map entries with inconsistent prefix length field in value vs. key
so that the entry will be rewritten with the correct prefix length field
in the value. This fixes a potential issue in upgrade where a policy map
entry may remain with zero valued prefix length in value leading to
incorrect policy enforcement.

Fixes: cilium#35150

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Oct 29, 2024
github-merge-queue bot pushed a commit that referenced this pull request Oct 30, 2024
[ upstream commit 60bc8fa ]

Move the prefix length field in the policy map value to the flags (same
byte as 'deny') to a previously unused space. This cleans up padding
space for future use.

Skip map entries with inconsistent prefix length field in value vs. key
so that the entry will be rewritten with the correct prefix length field
in the value. This fixes a potential issue in upgrade where a policy map
entry may remain with zero valued prefix length in value leading to
incorrect policy enforcement.

Fixes: #35150

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. backport/author The backport will be carried out by the author of the PR. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants