-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Miscellaneous improvements to the etcd client #28834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miscellaneous improvements to the etcd client #28834
Conversation
/cc @marseel |
c0d3a6c
to
a164f6d
Compare
/test |
1 similar comment
/test |
Last commit introduced an additional check to prevent a possible source of flakiness. |
6ea1682
to
f7a9394
Compare
Rebased onto main to fix conflict |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but it's been a while since I last touched that part of the code base. I've requested a review from @marseel per #28834 (comment)
f7a9394
to
c6a82cf
Compare
Rebased onto main to fix conflict |
/test All tests seem to have previously failed due to 1.1.1.1 being unreachable. |
/test Retrying again |
c6a82cf
to
60a0551
Compare
Rebased onto main to see if failures reproduce again (mainly conformance Gateway API, which seems totally unrelated). |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I will go over this PR one more time to make sure I checked all edge cases.
60a0551
to
871662e
Compare
/test |
Both the client and the lease manager are always guaranteed to be non-nil. Hence, let's drop the guarding if statements. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, upon connecting to an etcd cluster, we validate that all its endpoints have version greater than or equal to 3.1.0. etcd v3.1.0 got released in 2017, and the currently supported versions [1] are v3.4.x (first released in 2019) and v3.5.x (first released in 2021). Given the above and the fact that the check introduces additional load on etcd during connection establishment (especially relevant in the context of large scale cluster meshes), let's drop it. [1]: https://etcd.io/docs/v3.5/op-guide/versioning/ Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the etcd lease manager is implemented directly leveraging the Grant() and Keepalive() primitives. Yet, certain functionalities (i.e., etcd mutexes) depend on the composite Session type. Let's rework the manager to additionally construct such session objects, so that in a subsequent commit we can reuse the same logic to replace the lock session handling. Most notably, making the acquisition of said session lazy, so that the additional overhead (given the short TTL) can be avoided when not necessary (e.g., in the clustermesh case, as agents access remote etcd instances in read-only mode). The overall acquisition logic is not modified, except for the necessary adaptations. Unit tests are updated to account for the need to propagate a full (mocked) client object, instead of the mocked sub-interface only. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's explicit the fact that there's a small possibility that the returned lease/session is already expired, or gets expired immediately before use, due to the time window between server-side lease expiration and client-side detection. As we cannot completely remove this uncertainty period, let's just adopt the simplest approach here, and rely on retries to eventually converge. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit drops the currently used lock session acquisition logic and replaces it with a new instance of the lease manager, adapting the call sites. The lock session is still established regardless of whether the client is used in normal or read-only mode, which will be improved while subsequently refactoring the status checking logic. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
With the refactoring of the lease management logic, which is now lazily performed on-demand instead of at startup time, it is no longer necessary to explicitly wait for the initial session to be established. Hence, let's remove the last usages and the function itself. Additionally, this makes the behavior of the Update*() methods consistent with that of the other ones, as well as makes Close() no longer blocking. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The SetupDummy() function used to setup a kvstore client for testing purposes (indirectly) spawns a goroutine which panics if the connection to the kvstore fails. Yet, this is prone to race conditions, as in fast tests we might close the client before that the connection check has completed, causing the error to be raised. To avoid this issue, let's explicitly wait for the connection to be established during the setup. We never observed this race condition in the past because the Close() operation used to block until the connection was established (but that could hang in case of errors). Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
871662e
to
f96bf45
Compare
Rebased onto main to make CI happy |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
Please review commit by commit, and refer to the corresponding descriptions for additional information.