Skip to content

Improve endpoint and DNS proxy lock contention during bursty DNS traffic #19347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 11, 2022

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented Apr 7, 2022

Upon a successful DNS response, Cilium's DNS proxy code will sync the
DNS history state to the individual endpoint's header file. Previously,
this sync was done inside a trigger, however the calling code,
(*Endpoint).SyncEndpointHeaderFile(), acquired a write-lock for no good
reason. This effectively negated the benefits of having the DNS history
sync behind a trigger of 5 seconds.

This is especially suboptimal because the header file sync is actually
causing Cilium to serialize processing the DNS request for a single
endpoint.

To illustrate the impact of the above a bit more concretely, if a single
endpoint does 10 DNS requests at the same time, acquiring the write-lock
causes the processing of those 10 requests to be done one at a time. For
the sake of posterity, this is not the case if 10 endpoints were to make
DNS requests in parallel.

This obviously has a performance impact both in terms of being slow
CPU-wise, but also memory-wise. Take for example a DNS request bursty
environment, it could cause an uptick in memory usage due to many
goroutines being created and blocking due to the serialized nature of
locking.

Now that the code is all executing behind a trigger, we can remove the
lock completely and initialize the trigger setup where the Endpoint
object is created (e.g. createEndpoint(), parseEndpoint()). Now the lock
is only taken in every 5 seconds when the trigger runs.

This should relieve the lock contention drastrically. For context, in a
user's environment where the pprof was shared with us, there were around
440 goroutines with 203 of them stuck waiting inside
SyncEndpointHeaderFile().

We can also modify SyncEndpointHeaderFile() to no longer return an
error, because it's not possible for invoking the trigger to fail. If we
fail to initialize the trigger itself, then we log an error, but this is
essentially impossible because it can only fail if the trigger func is
nil (which we control).

Understanding the locking contention came from inspecting the pprof via
the following command and subsequent code inspection.

go tool trace -http :8080 ./cilium ./pprof-trace

Suggested-by: Michi Mutsuzaki michi@isovalent.com
Suggested-by: André Martins andre@cilium.io
Signed-off-by: Chris Tarazi chris@isovalent.com

@christarazi christarazi requested review from a team as code owners April 7, 2022 00:59
@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Apr 7, 2022
@christarazi christarazi requested a review from jrajahalme April 7, 2022 00:59
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 7, 2022
@christarazi christarazi changed the title pr/christarazi/ep lock dns trigger Improve endpoint and DNS proxy lock contention during bursty DNS traffic Apr 7, 2022
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find on the caller of the trigger grabbing the lock. Just one minor comment there to see if we could get more reliable initialization to avoid a potential race condition.

I was less sure about the second patch, probably better to have a more detailed discussion in the comments below. Though it also highlighted one potential (lack of) locking issue from outside subsystems into the endpoint which we should also follow up on.

OnDNSPolicyUpdateLocked(rules restore.DNSRules)
OnDNSPolicyUpdateForce(rules restore.DNSRules)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Not directly related to this PR, but how does the user of EndpointUpdater lock the Endpoint when they call this function?

The naming scheme behind the fooLocked() was an attempt to highlight that the appropriate locks must be held while calling functions like this, and this comes in useful sometimes while debugging stack traces from locking-related bugs, especially if the calls cross package boundaries like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as mentioned in similar comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to review all usage of this function to ensure that they're properly locking the endpoint when they update the DNS rules. This doesn't need to be done for this PR, but it could lead to strange behaviour in rare cases.

// sync the endpoint state to the header file.
e.UpdateDNSRulesIfNeeded()

if err := e.rlockAlive(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the detailed argument in the commit message, but I don't fully follow the logic for this second patch:

This is especially suboptimal in the context of the DNS proxy. Upon a
successful DNS response, the DNS proxy calls down to the endpoint to
sync the potentially new IPs from the DNS request to its header file via
writeHeaderfile() [1]. Therefore, while the DNS proxy is handling the
request, the header file synchronization is actually causing Cilium to
serialize processing the DNS request for a single endpoint. This is
especially apparent with the previous commit.

To illustrate the impact of the above a bit more concretely, if a single
endpoint does 10 DNS requests at the same time, acquiring the write-lock
causes the processing of those 10 requests to be done one at a time. For
the sake of posterity, this is not the case if 10 endpoints were to make
DNS requests in parallel.

After the first patch in this PR, that path no longer grabs the endpoint's write lock, it only triggers the update to disk, so multiple requests become a single call from the Trigger goroutine to handle a single write within a bounded period.

As far as I can tell, there are only two possible callpoints for writing the headerfile: Either during Endpoint regeneration (runPreCompilationSteps), at which point we generate the headerfile for potential use during compilation (or at least just to get the latest version of all of the endpoint's state to disk), or alternative from the TriggerFunc via SyncEndpointHeaderFile(). Both of these callpoints should be fairly tightly ratelimited already by other mechanisms in the daemon (for endpoint regeneration, probably just the relatively low number of events or in worst case, the build mutex to limit concurrent regenerations; for FQDN updates, the Trigger).

I believe that the original reasoning behind the writelock for the headerfile was more about the write to disk and avoiding multiple writes there rather than to do with the Endpoint datastructure itself. But by now, the Trigger should resolve that question in a better way. I note also that the callee for writing the headerfile purports to be tolerant of concurrent calls, as it first creates a temporary file with the new content, then atomically swaps the file into the correct location, so it's not incompatible with this approach.

The patch as-is still serializes the callers of writeHeaderfile(), since they still must grab the writelock to be able to synchronize the DNS Rules field. I agree that if writeHeaderfile() itself is expensive, then by reducing the size of the critical section, we could reduce lock contention during the expensive operation; though if the write to disk really is expensive, then we're now potentially exacerbating the issue (of spending more time writing to disk) by allowing yet more concurrent writes to disk.

Furthermore, e.buildMutex is already held for writing by both runPreCompilationSteps()1 and syncEndpointHeaderfile(), so even if the critical sections for the read/write locks of e.mutex are changed here, at least the headerfile write execution paths will not be impacted when it comes to parallel execution. Perhaps if there are other read-only accesses going on in the background from other parts of the agent and attempting to read Endpoint fields, then the reduced time spent holding the e.mutex for writing could allow those other reads to complete quicker. Is that what you are concerned about? Do you have examples of those read operations to highlight the use case for this kind of improvement? It would also be helpful to understand real-world examples of how long we expect these writes to take, particularly with worst-case scenarios like many FQDNs in the DNS history, as that could further motivate a change like this.

Footnotes

  1. via call stack Endpoint.regenerate() -> Endpoint.regenerateBPF() -> Endpoint.runPreCompilationSteps()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct in that this patch doesn't fully make the header file sync read-only, as I missed that runPreCompilationSteps() also writes to the header file.

As far as I can tell, there are only two possible callpoints for writing the headerfile: Either during Endpoint regeneration (runPreCompilationSteps), at which point we generate the headerfile for potential use during compilation (or at least just to get the latest version of all of the endpoint's state to disk), or alternative from the TriggerFunc via SyncEndpointHeaderFile(). Both of these callpoints should be fairly tightly ratelimited already by other mechanisms in the daemon (for endpoint regeneration, probably just the relatively low number of events or in worst case, the build mutex to limit concurrent regenerations; for FQDN updates, the Trigger).

I believe that the original reasoning behind the writelock for the headerfile was more about the write to disk and avoiding multiple writes there rather than to do with the Endpoint datastructure itself. But by now, the Trigger should resolve that question in a better way. I note also that the callee for writing the headerfile purports to be tolerant of concurrent calls, as it first creates a temporary file with the new content, then atomically swaps the file into the correct location, so it's not incompatible with this approach.

👍

The patch as-is still serializes the callers of writeHeaderfile(), since they still must grab the writelock to be able to synchronize the DNS Rules field. I agree that if writeHeaderfile() itself is expensive, then by reducing the size of the critical section, we could reduce lock contention during the expensive operation; though if the write to disk really is expensive, then we're now potentially exacerbating the issue (of spending more time writing to disk) by allowing yet more concurrent writes to disk.

Technically, with this patch yes, updating the DNSRules field causes us to grab the write-lock, which probably makes the execution much more serial than I was expecting and described in this commit msg. I do think it's a valuable change to reduce the critical section. It's worth mentioning that yes we are writing to disk, but this write is practically not really to disk -- this write occurs in the Cilium state directory which is typically mounted as a tmpfs so it's pretty much writing to memory -- just a longer-winded way to do it.

Furthermore, e.buildMutex is already held for writing by both runPreCompilationSteps()1 and syncEndpointHeaderfile(), so even if the critical sections for the read/write locks of e.mutex are changed here, at least the headerfile write execution paths will not be impacted when it comes to parallel execution. Perhaps if there are other read-only accesses going on in the background from other parts of the agent and attempting to read Endpoint fields, then the reduced time spent holding the e.mutex for writing could allow those other reads to complete quicker. Is that what you are concerned about? Do you have examples of those read operations to highlight the use case for this kind of improvement? It would also be helpful to understand real-world examples of how long we expect these writes to take, particularly with worst-case scenarios like many FQDNs in the DNS history, as that could further motivate a change like this.

In the end, yes I'd say this commit effectively is concerned about the above or at least tries to improve the above. I didn't do any measuring or benchmarks to back up my proposal as it would've required reproducing the environment where there was extremely bursty DNS traffic from one endpoint. As mentioned in the commit msg, I viewed the Go blocking profile (from pprof-trace) and concluded that there was heavy lock contention around the endpoint mutex, so I presume that by reducing the critical section with these changes, that it will improve. If we believe this is too risky of a change and needs concrete data to back it up, I'd be happy to take the time to think of how to benchmark it. It can likely be driven through a unit test.

I'm going to update the commit msg text to reflect the above, thanks for drilling into it and not taking my words at face-value :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth mentioning that yes we are writing to disk, but this write is practically not really to disk -- this write occurs in the Cilium state directory which is typically mounted as a tmpfs so it's pretty much writing to memory -- just a longer-winded way to do it.

💡 This never really occurred to me before, but given that /run (and hence also /var/run) are typically tmpfs, yes this would be the common case.

I'm ~OK with this, but this smells a bit like premature optimization to me. If we're typically running this in-memory write once every 5 seconds, and it's maybe a few kilobytes then I'd expect the locking delay from this particular operation to be maybe in the low microseconds, and only occasionally interrupting other operations. Sure we could make it faster, but the code becomes more nuanced to deal with, and I'm not sure whether the evidence is pointing towards this in particular being an issue.

As mentioned in the commit msg, I viewed the Go blocking profile (from pprof-trace) and concluded that there was heavy lock contention around the endpoint mutex

I suspect that the first patch will do a lot more for this than the second. But 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the first patch will do a lot more for this than the second.

Agreed. I can see the argument for premature optimization. My thinking was to try to reduce as much lock contention as I could.

From the user's sysdump and pprof, out of ~440 goroutines, 203 were blocked inside SyncEndpointHeaderFile(). While the first patch will likely resolve all that, just considering that that many goroutines could be going through this function was enough impetus for me to try to go as far as possible.

Do you feel there's a risk to this optimization that it could lead to regressions or more that the restructuring of the code it involved wasn't necessarily worth it given lack of evidence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the risk is that we make the locking access more complicated, putting more burden on the next person to come along to understand the implications of this change and to try to retain the same optimizations. But then if we don't provide a strong enough argument or context for the next person to reason about this optimization, then it's difficult for them to figure out whether any future improvements could cause a regression in performance here.

The implications there can range from simply "is it important to ensure that the writelock is not held while writing the headerfile?", which I'm currently assuming is rare for a relatively short period (though could plausibly change in future, either intentionally or unintentionally) through to "is it correct for these two operations to be now in separate critical sections?": since we'll now release the writelock, then grab the readlock, without a direct transition. Other goroutines could technically grab a lock in between the two critical sections. This should be OK right now, as the writelock part is just trying to do a best-effort grab of the latest DNS rules state. But if it turns out that these two critical sections get split up for a long period because of another goroutine grabbing & holding the lock in between, then you could imagine that the readlock portion where we write the headerfile could end up writing somewhat old state. In general this should not occur, and all lock accesses should be relatively short, but in the case of a bug, maybe we rarely hit a strange issue where stale DNS rules get written and that eventually leads to a restart / restore issue with old DNS information.

Copy link
Member Author

@christarazi christarazi Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that makes sense to me. I'll drop this commit, but keep it locally in case we may want to apply it again in the future. I'll transfer the commit msg over to the first patch.

FWIW,

in the case of a bug, maybe we rarely hit a strange issue where stale DNS rules get written and that eventually leads to a restart / restore issue with old DNS information.

since we are writing DNS information on a trigger every 5 seconds, technically the DNS information has gone stale as soon as we write it and begin waiting. Yes it's possible for us to write stale information just due to the scheduler and the gap between read and write locks, but the impact is minimal (/ acceptable IMO) since it'll just get updated again in the near future.

@christarazi christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 2e6a439 to 1982f9c Compare April 7, 2022 17:51
@christarazi christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 1982f9c to 099eb11 Compare April 7, 2022 17:52
@christarazi christarazi requested a review from joestringer April 7, 2022 17:52
@christarazi christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 099eb11 to 210da8a Compare April 7, 2022 17:53
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more minor locking issue to resolve.

@christarazi christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 210da8a to b891e77 Compare April 7, 2022 18:23
@@ -493,6 +495,19 @@ func createEndpoint(owner regeneration.Owner, policyGetter policyRepoGetter, nam
return ep
}

func (e *Endpoint) initDNSHistoryTrigger() {
// Note: This can only fail if the trigger func is nil.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally, this is a pretty weird API, this feels like something we could catch at compile time rather than encoding runtime errors to catch it. Something to think about for future improvements.

@christarazi
Copy link
Member Author

/test

@christarazi christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from b891e77 to 3f9751a Compare April 7, 2022 19:16
Upon a successful DNS response, Cilium's DNS proxy code will sync the
DNS history state to the individual endpoint's header file. Previously,
this sync was done inside a trigger, however the calling code,
(*Endpoint).SyncEndpointHeaderFile(), acquired a write-lock for no good
reason. This effectively negated the benefits of having the DNS history
sync behind a trigger of 5 seconds.

This is especially suboptimal because the header file sync is actually
causing Cilium to serialize processing the DNS request for a single
endpoint.

To illustrate the impact of the above a bit more concretely, if a single
endpoint does 10 DNS requests at the same time, acquiring the write-lock
causes the processing of those 10 requests to be done one at a time. For
the sake of posterity, this is not the case if 10 endpoints were to make
DNS requests in parallel.

This obviously has a performance impact both in terms of being slow
CPU-wise, but also memory-wise. Take for example a DNS request bursty
environment, it could cause an uptick in memory usage due to many
goroutines being created and blocking due to the serialized nature of
locking.

Now that the code is all executing behind a trigger, we can remove the
lock completely and initialize the trigger setup where the Endpoint
object is created (e.g. createEndpoint(), parseEndpoint()). Now the lock
is only taken in every 5 seconds when the trigger runs.

This should relieve the lock contention drastrically. For context, in a
user's environment where the pprof was shared with us, there were around
440 goroutines with 203 of them stuck waiting inside
SyncEndpointHeaderFile().

We can also modify SyncEndpointHeaderFile() to no longer return an
error, because it's not possible for invoking the trigger to fail. If we
fail to initialize the trigger itself, then we log an error, but this is
essentially impossible because it can only fail if the trigger func is
nil (which we control).

Understanding the locking contention came from inspecting the pprof via
the following command and subsequent code inspection.

```
go tool trace -http :8080 ./cilium ./pprof-trace
```

Suggested-by: Michi Mutsuzaki <michi@isovalent.com>
Suggested-by: André Martins <andre@cilium.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/ep-lock-dns-trigger branch from 3f9751a to a438f4c Compare April 7, 2022 19:20
@christarazi
Copy link
Member Author

christarazi commented Apr 7, 2022

@christarazi
Copy link
Member Author

christarazi commented Apr 7, 2022

/test-1.23-net-next

Edit: hit #19264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants