Add concurrency limiting for DNS message processing #19592

nebril · 2022-04-27T13:34:33Z

This commit does the following:

Adds a configuration option for controlling the concurrency of the DNS
proxy
Adds a configuration option for the semaphore (from above) timeout
Exposes an additional metric for the time taken to perform the policy
check on a DNS request within the DNS proxy

The concurrency limitation is done by introducing a semaphore to DNS
proxy. By default, no such limit is imposed.

Users are advised to take into account the number of DNS requests[1] and
how many CPUs on each node in their cluster in order to come up with an
appropriate concurrency limit.

In addition, we expose the semaphore grace period as a configurable
option. Assuming the "right" for this timeout is a tradeoff that we
shouldn't really assume for the user.

The semaphore grace period is to prevent the situation where Cilium
deadlocks or consistently high rate of DNS traffic causing Cilium to be
unable to keep up.

See #19543 (comment) by
joe@cilium.io.

The user can take into account the rate that they expect DNS requests to
be following into Cilium and how many of those requests should be
processed without retrying. If retrying isn't an issue then keeping the
grace period at 0 (default) will immediately free the goroutine handling
the DNS request if the semaphore acquire fails. Conversely, if a backlog
of "unproductive" goroutines is acceptable (and DNS request retries are
not), then setting the grace period is advisable. This gives the
goroutines some time to acquire the semaphore. Goroutines could pile up
if the grace period is too high and there's a consistently high rate of
DNS requests.

It's worth noting that blindly increasing the concurrency limit will not
linearly improve performance. It might actually degrade instead due to
internal downstream lock contention (as seen by the recent commits to
move Endpoint-related functions to use read-locks).

Ultimately, it becomes a tradeoff between high number of semaphore
timeouts (dropped DNS requests that must be retried) or high number of
(unproductive) goroutines, which can consume system resources.

[1]: The metric to monitor is

cilium_policy_l7_total{rule="received"}

Co-authored-by: Chris Tarazi chris@isovalent.com
Signed-off-by: Maciej Kwiek maciej@isovalent.com
Signed-off-by: Chris Tarazi chris@isovalent.com

nebril · 2022-04-27T13:39:45Z

/test

ciliumbot · 2022-04-27T17:10:23Z

Build finished.

nebril · 2022-04-28T10:40:18Z

/test

joestringer

I got confused about the multiple versions of this PR out, so I put my feedback on the wrong version. See #19543 (review) for more details.

Let's get this into the master tree first in the form that is acceptable upstream, then arrange the backports to line up the same.

pkg/fqdn/dnsproxy/proxy.go

christarazi · 2022-05-03T23:17:37Z

/test

christarazi · 2022-05-04T18:23:16Z

Noting that all of CI has passed, except for the runtime test due to legitimate failure. See diff for fix.

christarazi · 2022-05-04T18:29:05Z

/test-runtime

christarazi · 2022-05-10T17:14:03Z

/test

christarazi · 2022-05-10T18:44:05Z

/test

pkg/fqdn/dnsproxy/proxy.go

christarazi · 2022-05-10T19:35:02Z

/test

This commit does the following: * Adds a configuration option for controlling the concurrency of the DNS proxy * Adds a configuration option for the semaphore (from above) timeout * Exposes an additional metric for the time taken to perform the policy check on a DNS request within the DNS proxy The concurrency limitation is done by introducing a semaphore to DNS proxy. By default, no such limit is imposed. Users are advised to take into account the number of DNS requests[1] and how many CPUs on each node in their cluster in order to come up with an appropriate concurrency limit. In addition, we expose the semaphore grace period as a configurable option. Assuming the "right" for this timeout is a tradeoff that we shouldn't really assume for the user. The semaphore grace period is to prevent the situation where Cilium deadlocks or consistently high rate of DNS traffic causing Cilium to be unable to keep up. See cilium#19543 (comment) by <joe@cilium.io>. The user can take into account the rate that they expect DNS requests to be following into Cilium and how many of those requests should be processed without retrying. If retrying isn't an issue then keeping the grace period at 0 (default) will immediately free the goroutine handling the DNS request if the semaphore acquire fails. Conversely, if a backlog of "unproductive" goroutines is acceptable (and DNS request retries are not), then setting the grace period is advisable. This gives the goroutines some time to acquire the semaphore. Goroutines could pile up if the grace period is too high and there's a consistently high rate of DNS requests. It's worth noting that blindly increasing the concurrency limit will not linearly improve performance. It might actually degrade instead due to internal downstream lock contention (as seen by the recent commits to move Endpoint-related functions to use read-locks). Ultimately, it becomes a tradeoff between high number of semaphore timeouts (dropped DNS requests that must be retried) or high number of (unproductive) goroutines, which can consume system resources. [1]: The metric to monitor is ``` cilium_policy_l7_total{rule="received"} ``` Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Maciej Kwiek <maciej@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>

christarazi · 2022-05-10T21:56:30Z

CI passed except for legit error in runtime test.

christarazi · 2022-05-10T22:04:00Z

/test-runtime

nebril added kind/performance There is a performance impact of this. needs-backport/1.10 labels Apr 27, 2022

nebril requested a review from a team April 27, 2022 13:34

nebril requested a review from a team as a code owner April 27, 2022 13:34

maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 27, 2022

nebril requested review from jibi and michi-covalent April 27, 2022 13:34

nebril force-pushed the pr/nebril/dnproxy-concurrency-master branch from 8dfc0be to d400e05 Compare April 28, 2022 09:53

joestringer added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Apr 29, 2022

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 29, 2022

joestringer requested changes Apr 29, 2022

View reviewed changes

schlosna reviewed Apr 30, 2022

View reviewed changes

pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved

christarazi added the backport-pending/1.9 label May 3, 2022

christarazi mentioned this pull request May 3, 2022

[v1.9] Add concurrency limiting for DNS message processing #19543

Closed

christarazi force-pushed the pr/nebril/dnproxy-concurrency-master branch 3 times, most recently from 971e76b to 29e451b Compare May 3, 2022 22:21

joestringer requested changes May 3, 2022

View reviewed changes

pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved

christarazi force-pushed the pr/nebril/dnproxy-concurrency-master branch from 29e451b to 1f81231 Compare May 3, 2022 22:50

joestringer approved these changes May 3, 2022

View reviewed changes

pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved

christarazi force-pushed the pr/nebril/dnproxy-concurrency-master branch from 1f81231 to 8551838 Compare May 4, 2022 18:23

christarazi force-pushed the pr/nebril/dnproxy-concurrency-master branch from 8551838 to 48fbece Compare May 5, 2022 20:57

christarazi requested a review from joestringer May 10, 2022 06:24

christarazi force-pushed the pr/nebril/dnproxy-concurrency-master branch from c1bc016 to 75c328c Compare May 10, 2022 16:54

christarazi force-pushed the pr/nebril/dnproxy-concurrency-master branch from 75c328c to d865d64 Compare May 10, 2022 17:30

joestringer approved these changes May 10, 2022

View reviewed changes

pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved

christarazi force-pushed the pr/nebril/dnproxy-concurrency-master branch from d865d64 to f6a33ae Compare May 10, 2022 18:54

joestringer approved these changes May 10, 2022

View reviewed changes

christarazi force-pushed the pr/nebril/dnproxy-concurrency-master branch from f6a33ae to 866d91c Compare May 10, 2022 21:56

christarazi removed the backport-pending/1.9 label May 10, 2022

joestringer merged commit f481981 into cilium:master May 10, 2022

This was referenced May 11, 2022

Use FQDN regex LRU everywhere #19632

Merged

v1.11 backports 2022-05-17 #19858

Merged

christarazi mentioned this pull request May 18, 2022

v1.10 backports 2022-05-17 #19859

Merged

christarazi added backport-pending/1.10 and removed needs-backport/1.10 labels May 18, 2022

jibi added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels May 23, 2022

christarazi mentioned this pull request May 28, 2022

daemon, fqdn: Implement semaphore around DNS proxy #19485

Closed

jibi added backport-done/1.10 and removed backport-pending/1.10 labels May 31, 2022

This was referenced Jun 10, 2022

Prepare for release v1.10.12 #20168

Merged

Prepare for release v1.11.6 #20190

Merged

aanm mentioned this pull request Jun 22, 2022

Prepare for release v1.12.0-rc3 #20279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add concurrency limiting for DNS message processing #19592

Add concurrency limiting for DNS message processing #19592

nebril commented Apr 27, 2022 •

edited by christarazi

Loading

Uh oh!

nebril commented Apr 27, 2022

Uh oh!

ciliumbot commented Apr 27, 2022

Uh oh!

nebril commented Apr 28, 2022

Uh oh!

joestringer left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christarazi commented May 3, 2022

Uh oh!

christarazi commented May 4, 2022 •

edited

Loading

Uh oh!

christarazi commented May 4, 2022

Uh oh!

christarazi commented May 10, 2022

Uh oh!

christarazi commented May 10, 2022

Uh oh!

Uh oh!

christarazi commented May 10, 2022

Uh oh!

christarazi commented May 10, 2022

Uh oh!

christarazi commented May 10, 2022

Uh oh!

Uh oh!

Add concurrency limiting for DNS message processing #19592

Add concurrency limiting for DNS message processing #19592

Conversation

nebril commented Apr 27, 2022 • edited by christarazi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nebril commented Apr 27, 2022

Uh oh!

ciliumbot commented Apr 27, 2022

Uh oh!

nebril commented Apr 28, 2022

Uh oh!

joestringer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christarazi commented May 3, 2022

Uh oh!

christarazi commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christarazi commented May 4, 2022

Uh oh!

christarazi commented May 10, 2022

Uh oh!

christarazi commented May 10, 2022

Uh oh!

Uh oh!

christarazi commented May 10, 2022

Uh oh!

christarazi commented May 10, 2022

Uh oh!

christarazi commented May 10, 2022

Uh oh!

Uh oh!

nebril commented Apr 27, 2022 •

edited by christarazi

Loading

joestringer left a comment •

edited

Loading

christarazi commented May 4, 2022 •

edited

Loading