Skip to content

Conversation

brb
Copy link
Member

@brb brb commented Mar 9, 2021

It is not the end of the world if any arping related operation fails (e.g. frequent connections between nodes ensure the presence of relevant L2 entries in the neigh table). So, decrease the log level of the log msgs.

@brb brb added area/daemon Impacts operation of the Cilium daemon. release-note/misc This PR makes changes that have no direct user impact. needs-backport/1.8 labels Mar 9, 2021
@brb brb requested review from a team and jrfastab March 9, 2021 09:39
@brb brb force-pushed the pr/brb/reduce-arping-log-level branch from 82ea199 to 6aaa80e Compare March 9, 2021 09:39
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine reducing the log msg to level "Info" but if we are printing a "Info" message with the word "Fail" in it it might be confusing to users. If they are not that important maybe considering setting them as debug or add a new metric?

@brb
Copy link
Member Author

brb commented Mar 9, 2021

If they are not that important maybe considering setting them as debug or add a new metric?

@aanm I still want to see what exactly failed, so keeping them as log msg in info lvl instead of adding a metric.

@aanm
Copy link
Member

aanm commented Mar 9, 2021

If they are not that important maybe considering setting them as debug or add a new metric?

@aanm I still want to see what exactly failed, so keeping them as log msg in info lvl instead of adding a metric.

@brb you, a developer, yes, but users will see this and find it confusing.

@brb
Copy link
Member Author

brb commented Mar 9, 2021

@aanm Recently we had some big changes in the arp handling code. So I'd like to observe any possible discrepancies at least for a while. Later on, we can mute the messages, and expose errors via metrics.

@aanm
Copy link
Member

aanm commented Mar 9, 2021

@aanm Recently we had some big changes in the arp handling code. So I'd like to observe any possible discrepancies at least for a while. Later on, we can mute the messages, and expose errors via metrics.

Then let's prefix [DEBUG] in these messages

@brb brb force-pushed the pr/brb/reduce-arping-log-level branch from 6aaa80e to 5240b52 Compare March 15, 2021 08:20
@brb
Copy link
Member Author

brb commented Mar 15, 2021

Then let's prefix [DEBUG] in these messages

@aanm I've changed the logging subsystem to node-neigh-debug. PTAL.

Copy link
Contributor

@jrfastab jrfastab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could change "Failed" -> "Unable" then those messages would read less severe, e.g. "Failed to remove neighbor entry" becomes "Unable to remove neighbor entry".

@@ -695,7 +695,7 @@ func (n *linuxNodeHandler) insertNeighbor(ctx context.Context, newNode *nodeType
logfields.IPAddr: neigh.IP,
logfields.HardwareAddr: neigh.HardwareAddr,
logfields.LinkIndex: neigh.LinkIndex,
}).WithError(err).Warn("Failed to remove neighbor entry")
}).WithError(err).Info("Failed to remove neighbor entry")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would this happen and not be an error? Is it possible for the entry to be removed async to this operation so it does not exist or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the case when a network operator removes an entry manually.

@jrfastab
Copy link
Contributor

Beyond the scope of this patch, but I took look at the callers of insertNeighbor and shouldn't we return an error here and kick the retry logic? Otherwise we are waiting for the refresh logic to kick in from neighbor-table-refresh? I would expect a failed arp could be retried almost immediately and then backed off from there if it keeps failing.

@aanm aanm added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 29, 2021
@aanm aanm removed their assignment Mar 29, 2021
@brb brb force-pushed the pr/brb/reduce-arping-log-level branch from 5240b52 to 204fe6c Compare April 9, 2021 07:04
It is not the end of the world if any arping related operation fails
(e.g. frequent connections between nodes ensure the presence of
relevant L2 entries in the neigh table). So, decrease the log level of
the log msgs and change the log subsystem to "node-neigh-debug".

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb
Copy link
Member Author

brb commented Apr 9, 2021

Beyond the scope of this patch, but I took look at the callers of insertNeighbor and shouldn't we return an error here and kick the retry logic

@jrfastab I've added a retry logic into both arping libraries we use. They will retry 3 times. If all fail, then it is up the periodic refresh to fix any outstanding issues.

@brb brb force-pushed the pr/brb/reduce-arping-log-level branch from 204fe6c to eca1bed Compare April 9, 2021 07:12
@brb brb added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants