-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Improve endpoint and DNS proxy lock contention during bursty DNS traffic #19347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve endpoint and DNS proxy lock contention during bursty DNS traffic #19347
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find on the caller of the trigger grabbing the lock. Just one minor comment there to see if we could get more reliable initialization to avoid a potential race condition.
I was less sure about the second patch, probably better to have a more detailed discussion in the comments below. Though it also highlighted one potential (lack of) locking issue from outside subsystems into the endpoint which we should also follow up on.
pkg/proxy/logger/epinfo.go
Outdated
OnDNSPolicyUpdateLocked(rules restore.DNSRules) | ||
OnDNSPolicyUpdateForce(rules restore.DNSRules) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Not directly related to this PR, but how does the user of EndpointUpdater
lock the Endpoint
when they call this function?
The naming scheme behind the fooLocked()
was an attempt to highlight that the appropriate locks must be held while calling functions like this, and this comes in useful sometimes while debugging stack traces from locking-related bugs, especially if the calls cross package boundaries like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as mentioned in similar comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to review all usage of this function to ensure that they're properly locking the endpoint when they update the DNS rules. This doesn't need to be done for this PR, but it could lead to strange behaviour in rare cases.
pkg/endpoint/endpoint.go
Outdated
// sync the endpoint state to the header file. | ||
e.UpdateDNSRulesIfNeeded() | ||
|
||
if err := e.rlockAlive(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the detailed argument in the commit message, but I don't fully follow the logic for this second patch:
This is especially suboptimal in the context of the DNS proxy. Upon a
successful DNS response, the DNS proxy calls down to the endpoint to
sync the potentially new IPs from the DNS request to its header file via
writeHeaderfile() [1]. Therefore, while the DNS proxy is handling the
request, the header file synchronization is actually causing Cilium to
serialize processing the DNS request for a single endpoint. This is
especially apparent with the previous commit.To illustrate the impact of the above a bit more concretely, if a single
endpoint does 10 DNS requests at the same time, acquiring the write-lock
causes the processing of those 10 requests to be done one at a time. For
the sake of posterity, this is not the case if 10 endpoints were to make
DNS requests in parallel.
After the first patch in this PR, that path no longer grabs the endpoint's write lock, it only triggers the update to disk, so multiple requests become a single call from the Trigger goroutine to handle a single write within a bounded period.
As far as I can tell, there are only two possible callpoints for writing the headerfile: Either during Endpoint regeneration (runPreCompilationSteps
), at which point we generate the headerfile for potential use during compilation (or at least just to get the latest version of all of the endpoint's state to disk), or alternative from the TriggerFunc
via SyncEndpointHeaderFile()
. Both of these callpoints should be fairly tightly ratelimited already by other mechanisms in the daemon (for endpoint regeneration, probably just the relatively low number of events or in worst case, the build mutex to limit concurrent regenerations; for FQDN updates, the Trigger).
I believe that the original reasoning behind the writelock for the headerfile was more about the write to disk and avoiding multiple writes there rather than to do with the Endpoint datastructure itself. But by now, the Trigger should resolve that question in a better way. I note also that the callee for writing the headerfile purports to be tolerant of concurrent calls, as it first creates a temporary file with the new content, then atomically swaps the file into the correct location, so it's not incompatible with this approach.
The patch as-is still serializes the callers of writeHeaderfile()
, since they still must grab the writelock to be able to synchronize the DNS Rules field. I agree that if writeHeaderfile()
itself is expensive, then by reducing the size of the critical section, we could reduce lock contention during the expensive operation; though if the write to disk really is expensive, then we're now potentially exacerbating the issue (of spending more time writing to disk) by allowing yet more concurrent writes to disk.
Furthermore, e.buildMutex
is already held for writing by both runPreCompilationSteps()
1 and syncEndpointHeaderfile()
, so even if the critical sections for the read/write locks of e.mutex
are changed here, at least the headerfile write execution paths will not be impacted when it comes to parallel execution. Perhaps if there are other read-only accesses going on in the background from other parts of the agent and attempting to read Endpoint
fields, then the reduced time spent holding the e.mutex
for writing could allow those other reads to complete quicker. Is that what you are concerned about? Do you have examples of those read operations to highlight the use case for this kind of improvement? It would also be helpful to understand real-world examples of how long we expect these writes to take, particularly with worst-case scenarios like many FQDNs in the DNS history, as that could further motivate a change like this.
Footnotes
-
via call stack
Endpoint.regenerate()
->Endpoint.regenerateBPF()
->Endpoint.runPreCompilationSteps()
↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct in that this patch doesn't fully make the header file sync read-only, as I missed that runPreCompilationSteps()
also writes to the header file.
As far as I can tell, there are only two possible callpoints for writing the headerfile: Either during Endpoint regeneration (runPreCompilationSteps), at which point we generate the headerfile for potential use during compilation (or at least just to get the latest version of all of the endpoint's state to disk), or alternative from the TriggerFunc via SyncEndpointHeaderFile(). Both of these callpoints should be fairly tightly ratelimited already by other mechanisms in the daemon (for endpoint regeneration, probably just the relatively low number of events or in worst case, the build mutex to limit concurrent regenerations; for FQDN updates, the Trigger).
I believe that the original reasoning behind the writelock for the headerfile was more about the write to disk and avoiding multiple writes there rather than to do with the Endpoint datastructure itself. But by now, the Trigger should resolve that question in a better way. I note also that the callee for writing the headerfile purports to be tolerant of concurrent calls, as it first creates a temporary file with the new content, then atomically swaps the file into the correct location, so it's not incompatible with this approach.
👍
The patch as-is still serializes the callers of writeHeaderfile(), since they still must grab the writelock to be able to synchronize the DNS Rules field. I agree that if writeHeaderfile() itself is expensive, then by reducing the size of the critical section, we could reduce lock contention during the expensive operation; though if the write to disk really is expensive, then we're now potentially exacerbating the issue (of spending more time writing to disk) by allowing yet more concurrent writes to disk.
Technically, with this patch yes, updating the DNSRules field causes us to grab the write-lock, which probably makes the execution much more serial than I was expecting and described in this commit msg. I do think it's a valuable change to reduce the critical section. It's worth mentioning that yes we are writing to disk, but this write is practically not really to disk -- this write occurs in the Cilium state directory which is typically mounted as a tmpfs
so it's pretty much writing to memory -- just a longer-winded way to do it.
Furthermore, e.buildMutex is already held for writing by both runPreCompilationSteps()1 and syncEndpointHeaderfile(), so even if the critical sections for the read/write locks of e.mutex are changed here, at least the headerfile write execution paths will not be impacted when it comes to parallel execution. Perhaps if there are other read-only accesses going on in the background from other parts of the agent and attempting to read Endpoint fields, then the reduced time spent holding the e.mutex for writing could allow those other reads to complete quicker. Is that what you are concerned about? Do you have examples of those read operations to highlight the use case for this kind of improvement? It would also be helpful to understand real-world examples of how long we expect these writes to take, particularly with worst-case scenarios like many FQDNs in the DNS history, as that could further motivate a change like this.
In the end, yes I'd say this commit effectively is concerned about the above or at least tries to improve the above. I didn't do any measuring or benchmarks to back up my proposal as it would've required reproducing the environment where there was extremely bursty DNS traffic from one endpoint. As mentioned in the commit msg, I viewed the Go blocking profile (from pprof-trace) and concluded that there was heavy lock contention around the endpoint mutex, so I presume that by reducing the critical section with these changes, that it will improve. If we believe this is too risky of a change and needs concrete data to back it up, I'd be happy to take the time to think of how to benchmark it. It can likely be driven through a unit test.
I'm going to update the commit msg text to reflect the above, thanks for drilling into it and not taking my words at face-value :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth mentioning that yes we are writing to disk, but this write is practically not really to disk -- this write occurs in the Cilium state directory which is typically mounted as a tmpfs so it's pretty much writing to memory -- just a longer-winded way to do it.
💡 This never really occurred to me before, but given that /run
(and hence also /var/run
) are typically tmpfs, yes this would be the common case.
I'm ~OK with this, but this smells a bit like premature optimization to me. If we're typically running this in-memory write once every 5 seconds, and it's maybe a few kilobytes then I'd expect the locking delay from this particular operation to be maybe in the low microseconds, and only occasionally interrupting other operations. Sure we could make it faster, but the code becomes more nuanced to deal with, and I'm not sure whether the evidence is pointing towards this in particular being an issue.
As mentioned in the commit msg, I viewed the Go blocking profile (from pprof-trace) and concluded that there was heavy lock contention around the endpoint mutex
I suspect that the first patch will do a lot more for this than the second. But 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the first patch will do a lot more for this than the second.
Agreed. I can see the argument for premature optimization. My thinking was to try to reduce as much lock contention as I could.
From the user's sysdump and pprof, out of ~440 goroutines, 203 were blocked inside SyncEndpointHeaderFile()
. While the first patch will likely resolve all that, just considering that that many goroutines could be going through this function was enough impetus for me to try to go as far as possible.
Do you feel there's a risk to this optimization that it could lead to regressions or more that the restructuring of the code it involved wasn't necessarily worth it given lack of evidence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the risk is that we make the locking access more complicated, putting more burden on the next person to come along to understand the implications of this change and to try to retain the same optimizations. But then if we don't provide a strong enough argument or context for the next person to reason about this optimization, then it's difficult for them to figure out whether any future improvements could cause a regression in performance here.
The implications there can range from simply "is it important to ensure that the writelock is not held while writing the headerfile?", which I'm currently assuming is rare for a relatively short period (though could plausibly change in future, either intentionally or unintentionally) through to "is it correct for these two operations to be now in separate critical sections?": since we'll now release the writelock, then grab the readlock, without a direct transition. Other goroutines could technically grab a lock in between the two critical sections. This should be OK right now, as the writelock part is just trying to do a best-effort grab of the latest DNS rules state. But if it turns out that these two critical sections get split up for a long period because of another goroutine grabbing & holding the lock in between, then you could imagine that the readlock portion where we write the headerfile could end up writing somewhat old state. In general this should not occur, and all lock accesses should be relatively short, but in the case of a bug, maybe we rarely hit a strange issue where stale DNS rules get written and that eventually leads to a restart / restore issue with old DNS information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that makes sense to me. I'll drop this commit, but keep it locally in case we may want to apply it again in the future. I'll transfer the commit msg over to the first patch.
FWIW,
in the case of a bug, maybe we rarely hit a strange issue where stale DNS rules get written and that eventually leads to a restart / restore issue with old DNS information.
since we are writing DNS information on a trigger every 5 seconds, technically the DNS information has gone stale as soon as we write it and begin waiting. Yes it's possible for us to write stale information just due to the scheduler and the gap between read and write locks, but the impact is minimal (/ acceptable IMO) since it'll just get updated again in the near future.
2e6a439
to
1982f9c
Compare
1982f9c
to
099eb11
Compare
099eb11
to
210da8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more minor locking issue to resolve.
210da8a
to
b891e77
Compare
@@ -493,6 +495,19 @@ func createEndpoint(owner regeneration.Owner, policyGetter policyRepoGetter, nam | |||
return ep | |||
} | |||
|
|||
func (e *Endpoint) initDNSHistoryTrigger() { | |||
// Note: This can only fail if the trigger func is nil. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally, this is a pretty weird API, this feels like something we could catch at compile time rather than encoding runtime errors to catch it. Something to think about for future improvements.
/test |
b891e77
to
3f9751a
Compare
Upon a successful DNS response, Cilium's DNS proxy code will sync the DNS history state to the individual endpoint's header file. Previously, this sync was done inside a trigger, however the calling code, (*Endpoint).SyncEndpointHeaderFile(), acquired a write-lock for no good reason. This effectively negated the benefits of having the DNS history sync behind a trigger of 5 seconds. This is especially suboptimal because the header file sync is actually causing Cilium to serialize processing the DNS request for a single endpoint. To illustrate the impact of the above a bit more concretely, if a single endpoint does 10 DNS requests at the same time, acquiring the write-lock causes the processing of those 10 requests to be done one at a time. For the sake of posterity, this is not the case if 10 endpoints were to make DNS requests in parallel. This obviously has a performance impact both in terms of being slow CPU-wise, but also memory-wise. Take for example a DNS request bursty environment, it could cause an uptick in memory usage due to many goroutines being created and blocking due to the serialized nature of locking. Now that the code is all executing behind a trigger, we can remove the lock completely and initialize the trigger setup where the Endpoint object is created (e.g. createEndpoint(), parseEndpoint()). Now the lock is only taken in every 5 seconds when the trigger runs. This should relieve the lock contention drastrically. For context, in a user's environment where the pprof was shared with us, there were around 440 goroutines with 203 of them stuck waiting inside SyncEndpointHeaderFile(). We can also modify SyncEndpointHeaderFile() to no longer return an error, because it's not possible for invoking the trigger to fail. If we fail to initialize the trigger itself, then we log an error, but this is essentially impossible because it can only fail if the trigger func is nil (which we control). Understanding the locking contention came from inspecting the pprof via the following command and subsequent code inspection. ``` go tool trace -http :8080 ./cilium ./pprof-trace ``` Suggested-by: Michi Mutsuzaki <michi@isovalent.com> Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
3f9751a
to
a438f4c
Compare
/test Edit: flakes hit
|
/test-1.23-net-next Edit: hit #19264 |
Upon a successful DNS response, Cilium's DNS proxy code will sync the
DNS history state to the individual endpoint's header file. Previously,
this sync was done inside a trigger, however the calling code,
(*Endpoint).SyncEndpointHeaderFile(), acquired a write-lock for no good
reason. This effectively negated the benefits of having the DNS history
sync behind a trigger of 5 seconds.
This is especially suboptimal because the header file sync is actually
causing Cilium to serialize processing the DNS request for a single
endpoint.
To illustrate the impact of the above a bit more concretely, if a single
endpoint does 10 DNS requests at the same time, acquiring the write-lock
causes the processing of those 10 requests to be done one at a time. For
the sake of posterity, this is not the case if 10 endpoints were to make
DNS requests in parallel.
This obviously has a performance impact both in terms of being slow
CPU-wise, but also memory-wise. Take for example a DNS request bursty
environment, it could cause an uptick in memory usage due to many
goroutines being created and blocking due to the serialized nature of
locking.
Now that the code is all executing behind a trigger, we can remove the
lock completely and initialize the trigger setup where the Endpoint
object is created (e.g. createEndpoint(), parseEndpoint()). Now the lock
is only taken in every 5 seconds when the trigger runs.
This should relieve the lock contention drastrically. For context, in a
user's environment where the pprof was shared with us, there were around
440 goroutines with 203 of them stuck waiting inside
SyncEndpointHeaderFile().
We can also modify SyncEndpointHeaderFile() to no longer return an
error, because it's not possible for invoking the trigger to fail. If we
fail to initialize the trigger itself, then we log an error, but this is
essentially impossible because it can only fail if the trigger func is
nil (which we control).
Understanding the locking contention came from inspecting the pprof via
the following command and subsequent code inspection.
Suggested-by: Michi Mutsuzaki michi@isovalent.com
Suggested-by: André Martins andre@cilium.io
Signed-off-by: Chris Tarazi chris@isovalent.com