Skip to content

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Dec 5, 2024

Please review commit by commit, and refer to the individual descriptions for additional details.

I'm currently a bit on the fence whether any commit would need to be backported to earlier branches, although I'm leaning more towards not backporting them given that they address very specific corner case situations. The only issue which looks a bit more problematic to me is the one fixed in 6bebde0, because it could lead to accessing a nil variable, but it is my understanding that the UID is always guaranteed to be set. Anyways, please let me know your thoughts during the review.

Related: #36292 (comment)

/cc @squeed

The function always returns a nil error, hence let's just drop it.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. area/daemon Impacts operation of the Cilium daemon. release-note/bug This PR fixes an issue in a previous release of Cilium. area/agent Cilium agent related. labels Dec 5, 2024
The blamed commit introduced a fallback logic to retrieve the pod
corresponding to a given endpoint directly from k8s if the store
is out of date, and contains an entry with a mismatching UID. However,
the metadata is not recomputed based on the newly retrieved pod,
hence potentially configuring incorrect values. Let's get this fixed.

Fixes: f6606c6 ("daemon,endpoint,cni: Pass pod UID through CNI ADD")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/endpoint-metadata-resolver branch from 72c6530 to eed421a Compare December 5, 2024 11:33
@giorio94 giorio94 force-pushed the mio/endpoint-metadata-resolver branch from eed421a to 0762cc6 Compare December 5, 2024 12:57
@giorio94
Copy link
Member Author

giorio94 commented Dec 5, 2024

/test

@giorio94 giorio94 force-pushed the mio/endpoint-metadata-resolver branch from 0762cc6 to e0b34de Compare December 6, 2024 13:54
@giorio94 giorio94 added release-note/misc This PR makes changes that have no direct user impact. and removed kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Dec 6, 2024
Introduce a new unit test to cover the handleOutdatedPodInformer
function, in preparation for the subsequent changes. In particular,
this proves two problems currently affecting it:

1. The function does not propagate the error returned by the fetch
   function if the endpoint UID is not set.
2. The function does not perform any retry in case of pod UID
   mismatch, conversely to the intended behavior.

--- FAIL: TestHandleOutdatedPodInformer (0.00s)
    --- FAIL: TestHandleOutdatedPodInformer/pod_not_found_(epUID:_) (0.00s)
        endpoint_test.go:278:
                Error Trace:    .../daemon/cmd/endpoint_test.go:278
                Error:          Not equal:
                                expected: *errors.StatusError(&errors.StatusError{...
                                actual  : <nil>(<nil>)
                Test:           TestHandleOutdatedPodInformer/pod_not_found_(epUID:_)
    --- FAIL: TestHandleOutdatedPodInformer/uid_mismatch_(epUID:_uid) (0.00s)
        endpoint_test.go:288:
                Error Trace:    .../daemon/cmd/endpoint_test.go:288
                Error:          Not equal:
                                expected: 0x14
                                actual  : 0x1
                Test:           TestHandleOutdatedPodInformer/uid_mismatch_(epUID:_uid)
                Messages:       Incorrect number of retries
    --- FAIL: TestHandleOutdatedPodInformer/uid_mismatch,_then_resolved_(epUID:_uid) (0.00s)
        endpoint_test.go:278:
                Error Trace:    .../daemon/cmd/endpoint_test.go:278
                Error:          Not equal:
                                expected: <nil>(<nil>)
                                actual  : *errors.errorString(&errors.errorString{s:"pod store outdated"})
                Test:           TestHandleOutdatedPodInformer/uid_mismatch,_then_resolved_(epUID:_uid)
        endpoint_test.go:288:
                Error Trace:    .../daemon/cmd/endpoint_test.go:288
                Error:          Not equal:
                                expected: 0x6
                                actual  : 0x1
                Test:           TestHandleOutdatedPodInformer/uid_mismatch,_then_resolved_(epUID:_uid)
                Messages:       Incorrect number of retries
FAIL
FAIL    github.com/cilium/cilium/daemon/cmd     1.336s
FAIL

Related: f6606c6 ("daemon,endpoint,cni: Pass pod UID through CNI ADD")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The handleOutdatedPodInformer function currently does not propagate
fetch errors if the endpoint UID is not set. Although errors can only
occur if either the namespace or the pod are not present yet in the
store, let's get this fixed, to prevent possibly accessing uninitialized
variables.

Fixes: f6606c6 ("daemon,endpoint,cni: Pass pod UID through CNI ADD")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
According to the description of the blamed commit:

    If the UID is outdated, this means that Cilium has not yet received the
    updated pod event that corresponds with the CNI ADD. In this case,
    Cilium will attempt to query the pod store up for to 2 seconds. If the
    pod store does not have the altest pod reference after 2 seconds, Cilium
    will trigger a direct fetch from the apiserver [...]

However, the current implementation never performs any retry, because
the underlying retry logic returns immediately in case an error is
returned by the function. Let's fix this by not returning an error in
that case, as well as changing the duration period to actually match
the expected max retry time of 2 seconds.

Fixes: f6606c6 ("daemon,endpoint,cni: Pass pod UID through CNI ADD")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the update of no track rules and bandwidth policies
is handled posting an event to the event queue, which in turn
executes a callback to retrieve the corresponding pod from the
store, and in turn the annotation itself. Let's simplify this
by directly propagating the annotation(s) in the event itself,
given that all callers have access to pod object, and subsequent
annotation updates would just trigger a new event. This also
prevents a possible incorrect ending state in case the same
annotation is flipped multiple times.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Move the UID mismatch check directly to the function that retrieves the
metadata for a given endpoint, and let it return an error in that case.
This change ensures that the UID is always properly checked, and allows
simplifying the error handling processing in the callers.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
All callers either ignore this field or have direct access to the full
pod object. Hence, let's simply remove it, to prevent possible ambiguity.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/endpoint-metadata-resolver branch from e0b34de to bc5e3e3 Compare December 6, 2024 14:27
@giorio94
Copy link
Member Author

giorio94 commented Dec 6, 2024

/test

@giorio94 giorio94 requested a review from squeed December 6, 2024 17:58
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for simplifying and fixing this code

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 9, 2024
@borkmann borkmann merged commit baa1766 into cilium:main Dec 9, 2024
76 checks passed
@christarazi christarazi added affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch affects/v1.16 This issue affects v1.16 branch labels Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch affects/v1.16 This issue affects v1.16 branch area/agent Cilium agent related. area/daemon Impacts operation of the Cilium daemon. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants