-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add concurrency limiting for DNS message processing #19592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add concurrency limiting for DNS message processing #19592
Conversation
/test |
Build finished. |
8dfc0be
to
d400e05
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused about the multiple versions of this PR out, so I put my feedback on the wrong version. See #19543 (review) for more details.
Let's get this into the master tree first in the form that is acceptable upstream, then arrange the backports to line up the same.
971e76b
to
29e451b
Compare
29e451b
to
1f81231
Compare
/test |
Noting that all of CI has passed, except for the runtime test due to legitimate failure. See diff for fix. |
1f81231
to
8551838
Compare
/test-runtime |
8551838
to
48fbece
Compare
c1bc016
to
75c328c
Compare
/test |
75c328c
to
d865d64
Compare
/test |
d865d64
to
f6a33ae
Compare
/test |
This commit does the following: * Adds a configuration option for controlling the concurrency of the DNS proxy * Adds a configuration option for the semaphore (from above) timeout * Exposes an additional metric for the time taken to perform the policy check on a DNS request within the DNS proxy The concurrency limitation is done by introducing a semaphore to DNS proxy. By default, no such limit is imposed. Users are advised to take into account the number of DNS requests[1] and how many CPUs on each node in their cluster in order to come up with an appropriate concurrency limit. In addition, we expose the semaphore grace period as a configurable option. Assuming the "right" for this timeout is a tradeoff that we shouldn't really assume for the user. The semaphore grace period is to prevent the situation where Cilium deadlocks or consistently high rate of DNS traffic causing Cilium to be unable to keep up. See cilium#19543 (comment) by <joe@cilium.io>. The user can take into account the rate that they expect DNS requests to be following into Cilium and how many of those requests should be processed without retrying. If retrying isn't an issue then keeping the grace period at 0 (default) will immediately free the goroutine handling the DNS request if the semaphore acquire fails. Conversely, if a backlog of "unproductive" goroutines is acceptable (and DNS request retries are not), then setting the grace period is advisable. This gives the goroutines some time to acquire the semaphore. Goroutines could pile up if the grace period is too high and there's a consistently high rate of DNS requests. It's worth noting that blindly increasing the concurrency limit will not linearly improve performance. It might actually degrade instead due to internal downstream lock contention (as seen by the recent commits to move Endpoint-related functions to use read-locks). Ultimately, it becomes a tradeoff between high number of semaphore timeouts (dropped DNS requests that must be retried) or high number of (unproductive) goroutines, which can consume system resources. [1]: The metric to monitor is ``` cilium_policy_l7_total{rule="received"} ``` Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Maciej Kwiek <maciej@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>
CI passed except for legit error in runtime test. |
f6a33ae
to
866d91c
Compare
/test-runtime |
This commit does the following:
proxy
check on a DNS request within the DNS proxy
The concurrency limitation is done by introducing a semaphore to DNS
proxy. By default, no such limit is imposed.
Users are advised to take into account the number of DNS requests[1] and
how many CPUs on each node in their cluster in order to come up with an
appropriate concurrency limit.
In addition, we expose the semaphore grace period as a configurable
option. Assuming the "right" for this timeout is a tradeoff that we
shouldn't really assume for the user.
The semaphore grace period is to prevent the situation where Cilium
deadlocks or consistently high rate of DNS traffic causing Cilium to be
unable to keep up.
See #19543 (comment) by
joe@cilium.io.
The user can take into account the rate that they expect DNS requests to
be following into Cilium and how many of those requests should be
processed without retrying. If retrying isn't an issue then keeping the
grace period at 0 (default) will immediately free the goroutine handling
the DNS request if the semaphore acquire fails. Conversely, if a backlog
of "unproductive" goroutines is acceptable (and DNS request retries are
not), then setting the grace period is advisable. This gives the
goroutines some time to acquire the semaphore. Goroutines could pile up
if the grace period is too high and there's a consistently high rate of
DNS requests.
It's worth noting that blindly increasing the concurrency limit will not
linearly improve performance. It might actually degrade instead due to
internal downstream lock contention (as seen by the recent commits to
move Endpoint-related functions to use read-locks).
Ultimately, it becomes a tradeoff between high number of semaphore
timeouts (dropped DNS requests that must be retried) or high number of
(unproductive) goroutines, which can consume system resources.
[1]: The metric to monitor is
Co-authored-by: Chris Tarazi chris@isovalent.com
Signed-off-by: Maciej Kwiek maciej@isovalent.com
Signed-off-by: Chris Tarazi chris@isovalent.com