Skip to content

Conversation

giorio94
Copy link
Member

Please review commit by commit, and refer to the corresponding descriptions for additional information.

@giorio94 giorio94 added kind/cleanup This includes no functional changes. kind/performance There is a performance impact of this. release-note/misc This PR makes changes that have no direct user impact. area/kvstore Impacts the KVStore package interactions. labels Oct 27, 2023
@giorio94 giorio94 requested a review from a team as a code owner October 27, 2023 13:47
@giorio94 giorio94 requested a review from tklauser October 27, 2023 13:47
@giorio94
Copy link
Member Author

/cc @marseel

@giorio94 giorio94 force-pushed the mio/etcd-misc-improvements branch from c0d3a6c to a164f6d Compare October 27, 2023 13:57
@giorio94
Copy link
Member Author

/test

1 similar comment
@giorio94
Copy link
Member Author

/test

@giorio94
Copy link
Member Author

Last commit introduced an additional check to prevent a possible source of flakiness.

@giorio94 giorio94 force-pushed the mio/etcd-misc-improvements branch from 6ea1682 to f7a9394 Compare October 31, 2023 08:17
@giorio94
Copy link
Member Author

Rebased onto main to fix conflict

@giorio94
Copy link
Member Author

/test

@tklauser tklauser requested a review from marseel October 31, 2023 08:51
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but it's been a while since I last touched that part of the code base. I've requested a review from @marseel per #28834 (comment)

@giorio94 giorio94 force-pushed the mio/etcd-misc-improvements branch from f7a9394 to c6a82cf Compare November 2, 2023 17:16
@giorio94
Copy link
Member Author

giorio94 commented Nov 2, 2023

Rebased onto main to fix conflict

@giorio94
Copy link
Member Author

giorio94 commented Nov 3, 2023

/test

All tests seem to have previously failed due to 1.1.1.1 being unreachable.

@giorio94
Copy link
Member Author

giorio94 commented Nov 6, 2023

/test

Retrying again

@giorio94 giorio94 force-pushed the mio/etcd-misc-improvements branch from c6a82cf to 60a0551 Compare November 6, 2023 16:46
@giorio94
Copy link
Member Author

giorio94 commented Nov 6, 2023

Rebased onto main to see if failures reproduce again (mainly conformance Gateway API, which seems totally unrelated).

@giorio94
Copy link
Member Author

giorio94 commented Nov 6, 2023

/test

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I will go over this PR one more time to make sure I checked all edge cases.

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Nov 13, 2023
@giorio94 giorio94 added dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Nov 13, 2023
@giorio94 giorio94 force-pushed the mio/etcd-misc-improvements branch from 60a0551 to 871662e Compare November 13, 2023 17:42
@giorio94
Copy link
Member Author

/test

Both the client and the lease manager are always guaranteed to be
non-nil. Hence, let's drop the guarding if statements.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, upon connecting to an etcd cluster, we validate that all its
endpoints have version greater than or equal to 3.1.0. etcd v3.1.0 got
released in 2017, and the currently supported versions [1] are v3.4.x
(first released in 2019) and v3.5.x (first released in 2021). Given the
above and the fact that the check introduces additional load on etcd
during connection establishment (especially relevant in the context
of large scale cluster meshes), let's drop it.

[1]: https://etcd.io/docs/v3.5/op-guide/versioning/

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the etcd lease manager is implemented directly leveraging
the Grant() and Keepalive() primitives. Yet, certain functionalities
(i.e., etcd mutexes) depend on the composite Session type. Let's
rework the manager to additionally construct such session objects,
so that in a subsequent commit we can reuse the same logic to replace
the lock session handling. Most notably, making the acquisition of said
session lazy, so that the additional overhead (given the short TTL)
can be avoided when not necessary (e.g., in the clustermesh case, as
agents access remote etcd instances in read-only mode).

The overall acquisition logic is not modified, except for the necessary
adaptations. Unit tests are updated to account for the need to propagate
a full (mocked) client object, instead of the mocked sub-interface only.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's explicit the fact that there's a small possibility that the
returned lease/session is already expired, or gets expired immediately
before use, due to the time window between server-side lease expiration
and client-side detection. As we cannot completely remove this
uncertainty period, let's just adopt the simplest approach here, and
rely on retries to eventually converge.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit drops the currently used lock session acquisition logic and
replaces it with a new instance of the lease manager, adapting the call
sites. The lock session is still established regardless of whether the
client is used in normal or read-only mode, which will be improved while
subsequently refactoring the status checking logic.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
With the refactoring of the lease management logic, which is now
lazily performed on-demand instead of at startup time, it is no
longer necessary to explicitly wait for the initial session to be
established. Hence, let's remove the last usages and the function
itself. Additionally, this makes the behavior of the Update*()
methods consistent with that of the other ones, as well as makes
Close() no longer blocking.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The SetupDummy() function used to setup a kvstore client for testing
purposes (indirectly) spawns a goroutine which panics if the connection
to the kvstore fails. Yet, this is prone to race conditions, as in fast
tests we might close the client before that the connection check has
completed, causing the error to be raised.

To avoid this issue, let's explicitly wait for the connection to be
established during the setup. We never observed this race condition
in the past because the Close() operation used to block until the
connection was established (but that could hang in case of errors).

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/etcd-misc-improvements branch from 871662e to f96bf45 Compare November 14, 2023 07:40
@giorio94
Copy link
Member Author

Rebased onto main to make CI happy

@giorio94
Copy link
Member Author

/test

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@giorio94 giorio94 removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Nov 16, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 16, 2023
@julianwiedmann julianwiedmann merged commit 58624f2 into cilium:main Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kvstore Impacts the KVStore package interactions. kind/cleanup This includes no functional changes. kind/performance There is a performance impact of this. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants