Improve endpoint and DNS proxy lock contention during bursty DNS traffic #19347

christarazi · 2022-04-07T00:59:49Z

Upon a successful DNS response, Cilium's DNS proxy code will sync the
DNS history state to the individual endpoint's header file. Previously,
this sync was done inside a trigger, however the calling code,
(*Endpoint).SyncEndpointHeaderFile(), acquired a write-lock for no good
reason. This effectively negated the benefits of having the DNS history
sync behind a trigger of 5 seconds.

This is especially suboptimal because the header file sync is actually
causing Cilium to serialize processing the DNS request for a single
endpoint.

To illustrate the impact of the above a bit more concretely, if a single
endpoint does 10 DNS requests at the same time, acquiring the write-lock
causes the processing of those 10 requests to be done one at a time. For
the sake of posterity, this is not the case if 10 endpoints were to make
DNS requests in parallel.

This obviously has a performance impact both in terms of being slow
CPU-wise, but also memory-wise. Take for example a DNS request bursty
environment, it could cause an uptick in memory usage due to many
goroutines being created and blocking due to the serialized nature of
locking.

Now that the code is all executing behind a trigger, we can remove the
lock completely and initialize the trigger setup where the Endpoint
object is created (e.g. createEndpoint(), parseEndpoint()). Now the lock
is only taken in every 5 seconds when the trigger runs.

This should relieve the lock contention drastrically. For context, in a
user's environment where the pprof was shared with us, there were around
440 goroutines with 203 of them stuck waiting inside
SyncEndpointHeaderFile().

We can also modify SyncEndpointHeaderFile() to no longer return an
error, because it's not possible for invoking the trigger to fail. If we
fail to initialize the trigger itself, then we log an error, but this is
essentially impossible because it can only fail if the trigger func is
nil (which we control).

Understanding the locking contention came from inspecting the pprof via
the following command and subsequent code inspection.

go tool trace -http :8080 ./cilium ./pprof-trace

Suggested-by: Michi Mutsuzaki michi@isovalent.com
Suggested-by: André Martins andre@cilium.io
Signed-off-by: Chris Tarazi chris@isovalent.com

joestringer

Nice find on the caller of the trigger grabbing the lock. Just one minor comment there to see if we could get more reliable initialization to avoid a potential race condition.

I was less sure about the second patch, probably better to have a more detailed discussion in the comments below. Though it also highlighted one potential (lack of) locking issue from outside subsystems into the endpoint which we should also follow up on.

pkg/endpoint/endpoint.go

joestringer · 2022-04-07T02:22:39Z

pkg/proxy/logger/epinfo.go

-	OnDNSPolicyUpdateLocked(rules restore.DNSRules)
+	OnDNSPolicyUpdateForce(rules restore.DNSRules)


🤔 Not directly related to this PR, but how does the user of EndpointUpdater lock the Endpoint when they call this function?

The naming scheme behind the fooLocked() was an attempt to highlight that the appropriate locks must be held while calling functions like this, and this comes in useful sometimes while debugging stack traces from locking-related bugs, especially if the calls cross package boundaries like this.

Updated as mentioned in similar comment.

I think we still need to review all usage of this function to ensure that they're properly locking the endpoint when they update the DNS rules. This doesn't need to be done for this PR, but it could lead to strange behaviour in rare cases.

joestringer · 2022-04-07T03:30:22Z

pkg/endpoint/endpoint.go

+	// sync the endpoint state to the header file.
+	e.UpdateDNSRulesIfNeeded()
+
+	if err := e.rlockAlive(); err != nil {


I appreciate the detailed argument in the commit message, but I don't fully follow the logic for this second patch:

This is especially suboptimal in the context of the DNS proxy. Upon a
successful DNS response, the DNS proxy calls down to the endpoint to
sync the potentially new IPs from the DNS request to its header file via
writeHeaderfile() [1]. Therefore, while the DNS proxy is handling the
request, the header file synchronization is actually causing Cilium to
serialize processing the DNS request for a single endpoint. This is
especially apparent with the previous commit.

To illustrate the impact of the above a bit more concretely, if a single
endpoint does 10 DNS requests at the same time, acquiring the write-lock
causes the processing of those 10 requests to be done one at a time. For
the sake of posterity, this is not the case if 10 endpoints were to make
DNS requests in parallel.

After the first patch in this PR, that path no longer grabs the endpoint's write lock, it only triggers the update to disk, so multiple requests become a single call from the Trigger goroutine to handle a single write within a bounded period.

As far as I can tell, there are only two possible callpoints for writing the headerfile: Either during Endpoint regeneration (runPreCompilationSteps), at which point we generate the headerfile for potential use during compilation (or at least just to get the latest version of all of the endpoint's state to disk), or alternative from the TriggerFunc via SyncEndpointHeaderFile(). Both of these callpoints should be fairly tightly ratelimited already by other mechanisms in the daemon (for endpoint regeneration, probably just the relatively low number of events or in worst case, the build mutex to limit concurrent regenerations; for FQDN updates, the Trigger).

I believe that the original reasoning behind the writelock for the headerfile was more about the write to disk and avoiding multiple writes there rather than to do with the Endpoint datastructure itself. But by now, the Trigger should resolve that question in a better way. I note also that the callee for writing the headerfile purports to be tolerant of concurrent calls, as it first creates a temporary file with the new content, then atomically swaps the file into the correct location, so it's not incompatible with this approach.

The patch as-is still serializes the callers of writeHeaderfile(), since they still must grab the writelock to be able to synchronize the DNS Rules field. I agree that if writeHeaderfile() itself is expensive, then by reducing the size of the critical section, we could reduce lock contention during the expensive operation; though if the write to disk really is expensive, then we're now potentially exacerbating the issue (of spending more time writing to disk) by allowing yet more concurrent writes to disk.

Furthermore, e.buildMutex is already held for writing by both runPreCompilationSteps()¹ and syncEndpointHeaderfile(), so even if the critical sections for the read/write locks of e.mutex are changed here, at least the headerfile write execution paths will not be impacted when it comes to parallel execution. Perhaps if there are other read-only accesses going on in the background from other parts of the agent and attempting to read Endpoint fields, then the reduced time spent holding the e.mutex for writing could allow those other reads to complete quicker. Is that what you are concerned about? Do you have examples of those read operations to highlight the use case for this kind of improvement? It would also be helpful to understand real-world examples of how long we expect these writes to take, particularly with worst-case scenarios like many FQDNs in the DNS history, as that could further motivate a change like this.

Footnotes

via call stack Endpoint.regenerate() -> Endpoint.regenerateBPF() -> Endpoint.runPreCompilationSteps() ↩

You're correct in that this patch doesn't fully make the header file sync read-only, as I missed that runPreCompilationSteps() also writes to the header file.

As far as I can tell, there are only two possible callpoints for writing the headerfile: Either during Endpoint regeneration (runPreCompilationSteps), at which point we generate the headerfile for potential use during compilation (or at least just to get the latest version of all of the endpoint's state to disk), or alternative from the TriggerFunc via SyncEndpointHeaderFile(). Both of these callpoints should be fairly tightly ratelimited already by other mechanisms in the daemon (for endpoint regeneration, probably just the relatively low number of events or in worst case, the build mutex to limit concurrent regenerations; for FQDN updates, the Trigger).

I believe that the original reasoning behind the writelock for the headerfile was more about the write to disk and avoiding multiple writes there rather than to do with the Endpoint datastructure itself. But by now, the Trigger should resolve that question in a better way. I note also that the callee for writing the headerfile purports to be tolerant of concurrent calls, as it first creates a temporary file with the new content, then atomically swaps the file into the correct location, so it's not incompatible with this approach.

👍

The patch as-is still serializes the callers of writeHeaderfile(), since they still must grab the writelock to be able to synchronize the DNS Rules field. I agree that if writeHeaderfile() itself is expensive, then by reducing the size of the critical section, we could reduce lock contention during the expensive operation; though if the write to disk really is expensive, then we're now potentially exacerbating the issue (of spending more time writing to disk) by allowing yet more concurrent writes to disk.

Technically, with this patch yes, updating the DNSRules field causes us to grab the write-lock, which probably makes the execution much more serial than I was expecting and described in this commit msg. I do think it's a valuable change to reduce the critical section. It's worth mentioning that yes we are writing to disk, but this write is practically not really to disk -- this write occurs in the Cilium state directory which is typically mounted as a tmpfs so it's pretty much writing to memory -- just a longer-winded way to do it.

Furthermore, e.buildMutex is already held for writing by both runPreCompilationSteps()1 and syncEndpointHeaderfile(), so even if the critical sections for the read/write locks of e.mutex are changed here, at least the headerfile write execution paths will not be impacted when it comes to parallel execution. Perhaps if there are other read-only accesses going on in the background from other parts of the agent and attempting to read Endpoint fields, then the reduced time spent holding the e.mutex for writing could allow those other reads to complete quicker. Is that what you are concerned about? Do you have examples of those read operations to highlight the use case for this kind of improvement? It would also be helpful to understand real-world examples of how long we expect these writes to take, particularly with worst-case scenarios like many FQDNs in the DNS history, as that could further motivate a change like this.

In the end, yes I'd say this commit effectively is concerned about the above or at least tries to improve the above. I didn't do any measuring or benchmarks to back up my proposal as it would've required reproducing the environment where there was extremely bursty DNS traffic from one endpoint. As mentioned in the commit msg, I viewed the Go blocking profile (from pprof-trace) and concluded that there was heavy lock contention around the endpoint mutex, so I presume that by reducing the critical section with these changes, that it will improve. If we believe this is too risky of a change and needs concrete data to back it up, I'd be happy to take the time to think of how to benchmark it. It can likely be driven through a unit test.

I'm going to update the commit msg text to reflect the above, thanks for drilling into it and not taking my words at face-value :).

It's worth mentioning that yes we are writing to disk, but this write is practically not really to disk -- this write occurs in the Cilium state directory which is typically mounted as a tmpfs so it's pretty much writing to memory -- just a longer-winded way to do it.

💡 This never really occurred to me before, but given that /run (and hence also /var/run) are typically tmpfs, yes this would be the common case.

I'm ~OK with this, but this smells a bit like premature optimization to me. If we're typically running this in-memory write once every 5 seconds, and it's maybe a few kilobytes then I'd expect the locking delay from this particular operation to be maybe in the low microseconds, and only occasionally interrupting other operations. Sure we could make it faster, but the code becomes more nuanced to deal with, and I'm not sure whether the evidence is pointing towards this in particular being an issue.

As mentioned in the commit msg, I viewed the Go blocking profile (from pprof-trace) and concluded that there was heavy lock contention around the endpoint mutex

I suspect that the first patch will do a lot more for this than the second. But 🤷

I suspect that the first patch will do a lot more for this than the second.

Agreed. I can see the argument for premature optimization. My thinking was to try to reduce as much lock contention as I could.

From the user's sysdump and pprof, out of ~440 goroutines, 203 were blocked inside SyncEndpointHeaderFile(). While the first patch will likely resolve all that, just considering that that many goroutines could be going through this function was enough impetus for me to try to go as far as possible.

Do you feel there's a risk to this optimization that it could lead to regressions or more that the restructuring of the code it involved wasn't necessarily worth it given lack of evidence?

I think that the risk is that we make the locking access more complicated, putting more burden on the next person to come along to understand the implications of this change and to try to retain the same optimizations. But then if we don't provide a strong enough argument or context for the next person to reason about this optimization, then it's difficult for them to figure out whether any future improvements could cause a regression in performance here.

The implications there can range from simply "is it important to ensure that the writelock is not held while writing the headerfile?", which I'm currently assuming is rare for a relatively short period (though could plausibly change in future, either intentionally or unintentionally) through to "is it correct for these two operations to be now in separate critical sections?": since we'll now release the writelock, then grab the readlock, without a direct transition. Other goroutines could technically grab a lock in between the two critical sections. This should be OK right now, as the writelock part is just trying to do a best-effort grab of the latest DNS rules state. But if it turns out that these two critical sections get split up for a long period because of another goroutine grabbing & holding the lock in between, then you could imagine that the readlock portion where we write the headerfile could end up writing somewhat old state. In general this should not occur, and all lock accesses should be relatively short, but in the case of a bug, maybe we rarely hit a strange issue where stale DNS rules get written and that eventually leads to a restart / restore issue with old DNS information.

OK that makes sense to me. I'll drop this commit, but keep it locally in case we may want to apply it again in the future. I'll transfer the commit msg over to the first patch.

FWIW,

in the case of a bug, maybe we rarely hit a strange issue where stale DNS rules get written and that eventually leads to a restart / restore issue with old DNS information.

since we are writing DNS information on a trigger every 5 seconds, technically the DNS information has gone stale as soon as we write it and begin waiting. Yes it's possible for us to write stale information just due to the scheduler and the gap between read and write locks, but the impact is minimal (/ acceptable IMO) since it'll just get updated again in the near future.

pkg/endpoint/endpoint.go

joestringer

One more minor locking issue to resolve.

pkg/endpoint/bpf.go

joestringer · 2022-04-07T18:31:37Z

pkg/endpoint/endpoint.go

@@ -493,6 +495,19 @@ func createEndpoint(owner regeneration.Owner, policyGetter policyRepoGetter, nam
 	return ep
 }

+func (e *Endpoint) initDNSHistoryTrigger() {
+	// Note: This can only fail if the trigger func is nil.


Incidentally, this is a pretty weird API, this feels like something we could catch at compile time rather than encoding runtime errors to catch it. Something to think about for future improvements.

christarazi · 2022-04-07T18:55:27Z

/test

Upon a successful DNS response, Cilium's DNS proxy code will sync the DNS history state to the individual endpoint's header file. Previously, this sync was done inside a trigger, however the calling code, (*Endpoint).SyncEndpointHeaderFile(), acquired a write-lock for no good reason. This effectively negated the benefits of having the DNS history sync behind a trigger of 5 seconds. This is especially suboptimal because the header file sync is actually causing Cilium to serialize processing the DNS request for a single endpoint. To illustrate the impact of the above a bit more concretely, if a single endpoint does 10 DNS requests at the same time, acquiring the write-lock causes the processing of those 10 requests to be done one at a time. For the sake of posterity, this is not the case if 10 endpoints were to make DNS requests in parallel. This obviously has a performance impact both in terms of being slow CPU-wise, but also memory-wise. Take for example a DNS request bursty environment, it could cause an uptick in memory usage due to many goroutines being created and blocking due to the serialized nature of locking. Now that the code is all executing behind a trigger, we can remove the lock completely and initialize the trigger setup where the Endpoint object is created (e.g. createEndpoint(), parseEndpoint()). Now the lock is only taken in every 5 seconds when the trigger runs. This should relieve the lock contention drastrically. For context, in a user's environment where the pprof was shared with us, there were around 440 goroutines with 203 of them stuck waiting inside SyncEndpointHeaderFile(). We can also modify SyncEndpointHeaderFile() to no longer return an error, because it's not possible for invoking the trigger to fail. If we fail to initialize the trigger itself, then we log an error, but this is essentially impossible because it can only fail if the trigger func is nil (which we control). Understanding the locking contention came from inspecting the pprof via the following command and subsequent code inspection. ``` go tool trace -http :8080 ./cilium ./pprof-trace ``` Suggested-by: Michi Mutsuzaki <michi@isovalent.com> Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

christarazi · 2022-04-07T20:01:20Z

/test

Edit: flakes hit

runtime CI: RuntimeConntrackInVethModeTest Conntrack-related configuration options for endpoints: curl from to httpd2 fails #19124
net-next provisioning failed

christarazi · 2022-04-07T21:47:14Z

/test-1.23-net-next

Edit: hit #19264

christarazi requested review from a team as code owners April 7, 2022 00:59

christarazi requested a review from jrajahalme April 7, 2022 00:59

maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 7, 2022

christarazi changed the title ~~pr/christarazi/ep lock dns trigger~~ Improve endpoint and DNS proxy lock contention during bursty DNS traffic Apr 7, 2022

christarazi requested review from aanm, michi-covalent and nebril April 7, 2022 01:02

joestringer requested changes Apr 7, 2022

View reviewed changes

christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 2e6a439 to 1982f9c Compare April 7, 2022 17:51

christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 1982f9c to 099eb11 Compare April 7, 2022 17:52

christarazi requested a review from joestringer April 7, 2022 17:52

christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 099eb11 to 210da8a Compare April 7, 2022 17:53

joestringer requested changes Apr 7, 2022

View reviewed changes

pkg/endpoint/bpf.go Outdated Show resolved Hide resolved

joestringer approved these changes Apr 7, 2022

View reviewed changes

pkg/endpoint/bpf.go Outdated Show resolved Hide resolved

christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 210da8a to b891e77 Compare April 7, 2022 18:23

christarazi added the needs-backport/1.11 label Apr 7, 2022

joestringer reviewed Apr 7, 2022

View reviewed changes

nebril approved these changes Apr 7, 2022

View reviewed changes

christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from b891e77 to 3f9751a Compare April 7, 2022 19:16

christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 3f9751a to a438f4c Compare April 7, 2022 19:20

aanm approved these changes Apr 9, 2022

View reviewed changes

aanm merged commit 507332a into cilium:master Apr 11, 2022

christarazi deleted the pr/christarazi/ep-lock-dns-trigger branch April 11, 2022 23:21

nbusseneau mentioned this pull request Apr 12, 2022

v1.11 backports 2022-04-12 #19418

Merged

nbusseneau added backport-pending/1.11 and removed needs-backport/1.11 labels Apr 12, 2022

tklauser added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Apr 14, 2022

joestringer mentioned this pull request Apr 15, 2022

Prepare for release v1.11.4 #19462

Merged

michi-covalent added the needs-backport/1.10 label Jun 6, 2022

joamaki mentioned this pull request Jun 7, 2022

v1.10 backports 2022-06-07 #20100

Merged

joamaki added backport-pending/1.10 and removed needs-backport/1.10 labels Jun 7, 2022

joestringer mentioned this pull request Jun 10, 2022

Prepare for release v1.10.12 #20168

Merged

vipul-21 mentioned this pull request Jan 8, 2025

Updating the header file with DNS rules during policy update #36851

Closed

		OnDNSPolicyUpdateLocked(rules restore.DNSRules)
		OnDNSPolicyUpdateForce(rules restore.DNSRules)

Improve endpoint and DNS proxy lock contention during bursty DNS traffic #19347

Improve endpoint and DNS proxy lock contention during bursty DNS traffic #19347

Uh oh!

Conversation

christarazi commented Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christarazi Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christarazi commented Apr 7, 2022

Uh oh!

christarazi commented Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christarazi commented Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

christarazi commented Apr 7, 2022 •

edited

Loading

christarazi Apr 7, 2022 •

edited

Loading

christarazi commented Apr 7, 2022 •

edited

Loading

christarazi commented Apr 7, 2022 •

edited

Loading