-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Silence spurious clustermesh-related warnings #35867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Don't emit a spurious warning in case remote cluster configuration retrieval gets aborted due to the parent context being canceled (e.g., due to reconnecting to etcd), as already logged elsewhere, and potentially misleading to the users. Reported-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, a warning log gets always emitted in case the list operation to start a watcher fails, even though it will automatically retried. Given that we are starting to flag warning logs in CI, and this can briefly occur in the clustermesh context during the initialization phase (especially when the authorization mode is configured to cluster, until permissions are granted), let's lower the severity for the first few occurrences, and emit a warning only if the situation does not improve. Reported-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Explicitly configure the clustermesh-sync-timeout parameter to a higher value to prevent emitting spurious warning logs during the initial connection phase when the authorization mode is configured to cluster. Indeed, Cilium agents get restarted at that point (due to the configuration of host aliases), while the clustermesh-apiserver does not (assuming that KVStoreMesh is disabled), potentially requiring up to one minute to react to the configmap change and create the associated user in etcd. Still, the warning in this case would have been benign, because no cross-cluster connections could be present, as we are meshing the two clusters for the first time right now. Reported-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
/test |
@@ -158,7 +158,12 @@ func (rc *remoteCluster) restartRemoteConnection() { | |||
if backend != nil { | |||
backend.Close() | |||
} | |||
rc.logger.WithError(err).Warning("Unable to establish etcd connection to remote cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giorio94 I see you did not backport this change to v1.16. Was that intentional? I'm seeing it on downgrades to v1.16: https://github.com/cilium/cilium/actions/runs/11935307422/job/33266363458. Should I just allowlist it until v1.17 is out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing. It seems I mistakenly linked a different backport to this one, and marked this one as completed (rather than the other one). Marking again for backport, so that it gets included in the next round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah, that's why the commits looked so different 😆
Please refer to the individual commits for additional details.
Fixes: #35865
Marked for backport to v1.16 to prevent issues on downgrade when warning logs start being flagged in CI.