Skip to content

Conversation

nathanjsweet
Copy link
Member

@nathanjsweet nathanjsweet commented May 17, 2023

  • Add Tests for Deny Precedence Bug:

    Currently, when a broad "Deny" policy is paired with a specific "Unmanaged CIDR" policy, then the "Unmanaged CIDR" policy will still be inserted into the policy map for an endpoint. This results in "Deny" policies not always taking precedence over "Allow" policies. This test confirms the bugs existence.

  • Fix Deny Precedence Bug:

    When the policy map state is created CIDRs are now checked against one another to ensure that deny-rules that supersede allow-rules when they should.

    DenyPreferredInsert has been refactored to use utility methods that make the complex boolean logic of policy precedence more atomic.

    Add NetsContainsAny method to pkg/ip to compare cases where one set of networks conatins or is equal to any network in another set.

  • endpoint: Add policy.Identity Implementation

    A policy.Identity implementation is necessary for the incremental update to the endpoint's policy map that can occur with L7 changes. Valid deny-policy entries may prohibit these L7 changes based on CIDR rules, which are only obtainable by looking up all potentially conflicting policies' labels. Thus l4.ToMapState needs access to the identity allocater to lookup "random" identity labels.

  • Additional commit from @joestringer to fix concurrent access to the SelectorCache, a bug the original work introduced.

  • Additional commit from @nathanjsweet to prevent a potential nil pointer dereference in the endpoint code.

policy: Promote Deny Policies from Beta to Stable

@nathanjsweet nathanjsweet added release-note/major This PR introduces major new functionality to Cilium. kind/backports This PR provides functionality previously merged into master. backport/1.11 labels May 17, 2023
@nathanjsweet nathanjsweet requested a review from a team as a code owner May 17, 2023 02:28
@nathanjsweet
Copy link
Member Author

We'll need to upgrade golang (currently 1.17) for the v1.11 branch as we need the net/netip package which was released in go 1.18.

@nathanjsweet
Copy link
Member Author

depends on #25513.

@nathanjsweet nathanjsweet added the dont-merge/blocked Another PR must be merged before this one. label May 17, 2023
@christarazi
Copy link
Member

Let's take a step back. Couldn't we replace the netip.* types with net.* equivalents and avoid the Go upgrade?

@nathanjsweet
Copy link
Member Author

nathanjsweet commented May 17, 2023

Let's take a step back. Couldn't we replace the netip.* types with net.* equivalents and avoid the Go upgrade?

Yeah, I thought its usage was more pervasive in this PR, but it's completely unnecessary.

@nathanjsweet nathanjsweet force-pushed the pr/nathanjsweet/backport-deny-cidr-fix-to-1-11 branch from a7ce13e to 0e8ea99 Compare May 17, 2023 18:32
@nathanjsweet nathanjsweet removed the dont-merge/blocked Another PR must be merged before this one. label May 17, 2023
@christarazi
Copy link
Member

👍 Please close the corresponding backport PR to bump Go so that the backporter / triager doesn't spend time on it.

@nathanjsweet nathanjsweet force-pushed the pr/nathanjsweet/backport-deny-cidr-fix-to-1-11 branch 2 times, most recently from fe0b99b to 12c4609 Compare May 17, 2023 21:24
@nathanjsweet
Copy link
Member Author

Sorry for the spam. This backport is turning out to be trickier than I anticipated.

nathanjsweet and others added 3 commits May 17, 2023 16:26
[ upstream commit c9f0def ]

- Add Tests for Deny Precedence Bug:

  Currently, when a broad "Deny" policy is paired with
  a specific "Unmanaged CIDR" policy, then the "Unmanaged
  CIDR" policy will still be inserted into the policy map
  for an endpoint. This results in "Deny" policies not
  always taking precedence over "Allow" policies. This
  test confirms the bugs existence.

- Fix Deny Precedence Bug:

  When the policy map state is created CIDRs
  are now checked against one another to ensure
  that deny-rules that supersede allow-rules
  when they should.

  `DenyPreferredInsert` has been refactored to
  use utility methods that make the complex boolean
  logic of policy precedence more atomic.

  Add `NetsContainsAny` method to `pkg/ip` to
  compare cases where one set of networks conatins
  or is equal to any network in another set.

- endpoint: Add policy.Identity Implementation

  A `policy.Identity` implementation is necessary
  for the incremental update to the endpoint's policy
  map that can occur with L7 changes. Valid deny-policy
  entries may prohibit these L7 changes based on CIDR
  rules, which are only obtainable by looking up all potentially
  conflicting policies' labels. Thus `l4.ToMapState` needs
  access to the identity allocater to lookup "random"
  identity labels.

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
[ upstream commit 52ace8e ]

Marco Iorio reports that with previous code, Cilium could crash at
runtime after importing a network policy, with the following error
printed to the logs:

    fatal error: concurrent map read and map write

The path for this issue is printed also in the logs, with the following
call stack:

    pkg/policy.(*SelectorCache).GetLabels(...)
    pkg/policy.(*MapStateEntry).getNets(...)
    pkg/policy.entryIdentityIsSupersetOf(...)
    pkg/policy.MapState.denyPreferredInsertWithChanges(...)
    pkg/policy.MapState.DenyPreferredInsert(...)
    pkg/policy.(*EndpointPolicy).computeDirectionL4PolicyMapEntries(...)
    pkg/policy.(*EndpointPolicy).computeDesiredL4PolicyMapEntries(...)
    pkg/policy.(*selectorPolicy).DistillPolicy(...)
    pkg/policy.(*cachedSelectorPolicy).Consume(...)
    pkg/endpoint.(*Endpoint).regeneratePolicy(...)
    ...

Upon further inspection, this call path is not grabbing the
SelectorCache lock at any point. If we check all of the incoming calls
to this function, we can see multiple higher level functions calling
into this function. The following tree starts from the deepest level of
the call stack and increasing indentation represents one level higher in
the call stack.

INCOMING CALLS
- f GetLabels github.com/cilium/cilium/pkg/policy • selectorcache.go
  - f getNets github.com/cilium/cilium/pkg/policy • mapstate.go
    - f entryIdentityIsSupersetOf github.com/cilium/cilium/pkg/policy • mapstate.go
      - f denyPreferredInsertWithChanges github.com/cilium/cilium/pkg/policy • mapstate.go
        - f DenyPreferredInsert github.com/cilium/cilium/pkg/policy • mapstate.go
          - f computeDirectionL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
            - f computeDesiredL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
              + f DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- No SelectorCache lock
          - f DetermineAllowLocalhostIngress github.com/cilium/cilium/pkg/policy • mapstate.go
            + f DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- No SelectorCache lock
        - f consumeMapChanges github.com/cilium/cilium/pkg/policy • mapstate.go
          + f ConsumeMapChanges github.com/cilium/cilium/pkg/policy • resolve.go <--- Already locks the SelectorCache

Read the above tree as "GetLabels() is called by getNets()",
"getNets() is called by entryIdentityIsSupersetOf()", and so on.
Siblings at the same level of indent represent alternate callers of the
function that is one level of indentation less in the tree, ie
DenyPreferredInsert() and consumeMapChanges() both call
denyPreferredInsertWithChanges().

As annotated above, we see that calls through DistillPolicy() do not
grab the SelectorCache lock. Given that ConsumeMapChanges() grabs the
SelectorCache lock, we cannot introduce a new lock acquisition in any
descendent function, otherwise it would introduce a deadlock in
goroutines that follow that call path. This provides us the option to
lock at some point from the sibling of consumeMapChanges() or higher in
the call stack.

Given that the ancestors of DenyPreferredInsert() are all from
DistillPolicy(), we can amortize the cost of grabbing the SelectorCache
lock by grabbing it once for the policy distillation phase rather than
putting the lock into DenyPreferredInsert() where the SelectorCache
could be locked and unlocked for each map state entry.

Future work could investigate whether these call paths could make use of
the IdentityAllocator's cache of local identities for the GetLabels()
call rather than relying on the SelectorCache, but for now this patch
should address the immediate locking issue that triggers agent crashes.

CC: Nate Sweet <nathanjsweet@pm.me>
Fixes: c9f0def ("policy: Fix Deny Precedence Bug")
Reported-by: Marco Iorio <marco.iorio@isovalent.com>
Co-authored-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
[ upstream commit bfbe5a2 ]

The policyIdentityLabelLookup wrapper for Endpoint implements
the GetLabels interface method. This is necessary for the
constructing the MapState of the policy engine. This implementation
incorrectly did not check if the identity returned by LookupIdentityByID
was nil. This fixes this bug, which heretofore has not caused any issues.

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
@nathanjsweet nathanjsweet force-pushed the pr/nathanjsweet/backport-deny-cidr-fix-to-1-11 branch from 12c4609 to 3351bc6 Compare May 17, 2023 21:27
@nathanjsweet
Copy link
Member Author

nathanjsweet commented May 18, 2023

/test-backport-1.11

Job 'Cilium-PR-K8s-1.18-kernel-4.9' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 192.168.56.12:80 from testclient-52m86

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.18-kernel-4.9/2725/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.18-kernel-4.9 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@nathanjsweet
Copy link
Member Author

/test-1.18-4.9

@nathanjsweet nathanjsweet added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 31, 2023
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main commit looks good to me. No major diffs of concern between backport and main version.

@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 31, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 2, 2023
@julianwiedmann julianwiedmann merged commit 7dc319e into v1.11 Jun 2, 2023
@julianwiedmann julianwiedmann deleted the pr/nathanjsweet/backport-deny-cidr-fix-to-1-11 branch June 2, 2023 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/major This PR introduces major new functionality to Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants