Skip to content

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Jun 7, 2022

PRs skipped due conflicts:

Once this PR is merged, you can update the PR labels via:

$ for pr in 19347 19570 19423 19383 19857 19987 19933 19925 19959 19999 20011; do contrib/backporting/set-labels.py $pr done 1.10; done

or with

$ make add-label BRANCH=v1.10 ISSUES=19347,19570,19423,19383,19857,19987,19933,19925,19959,19999,20011

christarazi and others added 11 commits June 7, 2022 11:51
[ upstream commit 507332a ]

Upon a successful DNS response, Cilium's DNS proxy code will sync the
DNS history state to the individual endpoint's header file. Previously,
this sync was done inside a trigger, however the calling code,
(*Endpoint).SyncEndpointHeaderFile(), acquired a write-lock for no good
reason. This effectively negated the benefits of having the DNS history
sync behind a trigger of 5 seconds.

This is especially suboptimal because the header file sync is actually
causing Cilium to serialize processing the DNS request for a single
endpoint.

To illustrate the impact of the above a bit more concretely, if a single
endpoint does 10 DNS requests at the same time, acquiring the write-lock
causes the processing of those 10 requests to be done one at a time. For
the sake of posterity, this is not the case if 10 endpoints were to make
DNS requests in parallel.

This obviously has a performance impact both in terms of being slow
CPU-wise, but also memory-wise. Take for example a DNS request bursty
environment, it could cause an uptick in memory usage due to many
goroutines being created and blocking due to the serialized nature of
locking.

Now that the code is all executing behind a trigger, we can remove the
lock completely and initialize the trigger setup where the Endpoint
object is created (e.g. createEndpoint(), parseEndpoint()). Now the lock
is only taken in every 5 seconds when the trigger runs.

This should relieve the lock contention drastrically. For context, in a
user's environment where the pprof was shared with us, there were around
440 goroutines with 203 of them stuck waiting inside
SyncEndpointHeaderFile().

We can also modify SyncEndpointHeaderFile() to no longer return an
error, because it's not possible for invoking the trigger to fail. If we
fail to initialize the trigger itself, then we log an error, but this is
essentially impossible because it can only fail if the trigger func is
nil (which we control).

Understanding the locking contention came from inspecting the pprof via
the following command and subsequent code inspection.

```
go tool trace -http :8080 ./cilium ./pprof-trace
```

Suggested-by: Michi Mutsuzaki <michi@isovalent.com>
Suggested-by: André Martins <andre@cilium.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 4c2c244 ]

Use strings.Builder instead of fmt.Sprintf() and preallocate the size of
the string so that Go doesn't need to over-allocate if the string ends
up longer than what the buffer growth algorithm predicts.

Results:

```
$ go test -v -run '^$' -bench 'BenchmarkFQDNSelectorString' -benchmem ./pkg/policy/api > old.txt
$ go test -v -run '^$' -bench 'BenchmarkFQDNSelectorString' -benchmem ./pkg/policy/api > new.txt
$ benchcmp old.txt new.txt
benchmark                         old ns/op     new ns/op     delta
BenchmarkFQDNSelectorString-8     690           180           -73.97%

benchmark                         old allocs     new allocs     delta
BenchmarkFQDNSelectorString-8     9              4              -55.56%

benchmark                         old bytes     new bytes     delta
BenchmarkFQDNSelectorString-8     288           208           -27.78%
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 351c5d8 ]

SortedList() and FormatForKVStore() can be very hot code in environments
where there's constant policy churn, especially CIDR policies where
there can be a large number of CIDR labels.

This commit adds benchmarks for later commits to use as a baseline.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 0790e07 ]

FormatForKVStore() previously returned a string for no reason as every
caller converted the return value to a byte slice. This allows us to
eliminate string concatenation entirely and use the bytes.Buffer
directly.

Building on the above, given that SortedList() returns a byte slice and
calls FormatForKVStore() for its output, we can optimize it with the
same technique to eliminate string concatenation.

Here are the benchmark comparisons:

```
$ go test -v -run '^$' -bench 'BenchmarkLabels_SortedList|BenchmarkLabel_FormatForKVStore' -benchmem ./pkg/labels > old.txt
$ go test -v -run '^$' -bench 'BenchmarkLabels_SortedList|BenchmarkLabel_FormatForKVStore' -benchmem ./pkg/labels > new.txt
$ benchcmp old.txt new.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                             old ns/op     new ns/op     delta
BenchmarkLabels_SortedList-8          2612          1120          -57.12%
BenchmarkLabel_FormatForKVStore-8     262           54.5          -79.18%

benchmark                             old allocs     new allocs     delta
BenchmarkLabels_SortedList-8          35             13             -62.86%
BenchmarkLabel_FormatForKVStore-8     4              1              -75.00%

benchmark                             old bytes     new bytes     delta
BenchmarkLabels_SortedList-8          1112          664           -40.29%
BenchmarkLabel_FormatForKVStore-8     96            48            -50.00%
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 5fa7ae2 ]

Advanced users can configure the LRU size for the cache holding the
compiled regex expressions of FQDN match{Pattern,Name}. This is useful
if users are experiencing high memory usage spikes with many FQDN
policies that have repeated matchPattern or matchName across many
different policies.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 38c0036 ]

When comparing efficieny of increasing the LRU size from 128 to 1024
with ~22k CNPs, we see the following results:

```
\# LRU size 128.
$ go test -tags privileged_tests -v -run '^$' -bench Benchmark_perEPAllow_setPortRulesForID_large -benchmem -benchtime 1x -memprofile memprofile.out ./pkg/fqdn/dnsproxy > old.txt

\# LRU size 1024.
$ go test -tags privileged_tests -v -run '^$' -bench Benchmark_perEPAllow_setPortRulesForID_large -benchmem -benchtime 1x -memprofile memprofile.out ./pkg/fqdn/dnsproxy > new.txt

$ benchcmp old.txt new.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                                          old ns/op      new ns/op      delta
Benchmark_perEPAllow_setPortRulesForID_large-8     3954101340     3010934555     -23.85%

benchmark                                          old allocs     new allocs     delta
Benchmark_perEPAllow_setPortRulesForID_large-8     26480632       24167742       -8.73%

benchmark                                          old bytes      new bytes      delta
Benchmark_perEPAllow_setPortRulesForID_large-8     2899811832     1824062992     -37.10%
```

Here's the raw test run with LRU size at 128:

```
Before (N=1)
Alloc = 31 MiB  HeapInuse = 45 MiB      Sys = 1260 MiB  NumGC = 15
After (N=1)
Alloc = 445 MiB HeapInuse = 459 MiB     Sys = 1260 MiB  NumGC = 40
```

Here's the raw test run with LRU size at 1024:

```
Before (N=1)
Alloc = 31 MiB  HeapInuse = 48 MiB      Sys = 1177 MiB  NumGC = 17
After (N=1)
Alloc = 78 MiB  HeapInuse = 93 MiB      Sys = 1177 MiB  NumGC = 53
```

We can see that it's saving ~300MB.

Furthermore, if we compare the memprofiles from the benchmark run via

```
go tool pprof -http :8080 -diff_base memprofile.out memprofile.1024.out
```

we see an ~800MB reduction in the regex compilation.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit ce9583d ]

Following the previous commit's benchmark result, let's update the LRU
default size to be 1024, given that it only results in a few 10's of MBs
increase when the cache nears full.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 95362de ]

This commit adds a new convenience functions to get a count of
 * how many entries are inside the DNS cache (IPs)
 * and how many FQDNs are inside the DNS cache
It will be used by upcoming commits to expose these values as a metrics.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit d9ae1c4 ]

This commit contains no functional changes and is only cosmetic to ease
future commits when adding new FQDN-related metrics.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 47870fe ]

This commit exposes new metrics that show the number of active names and
IPs in the DNS cache, and number of alive FQDN connections that have
expired (aka zombies) per endpoint. This is useful to track the
endpoint's DNS cache and DNS zombie cache sizes over time.

Note that these metrics are only updated during the FQDN GC which is
currently invoked every minute.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 044681a ]

This commit is to bump prometheus client library to the latest (e.g. v1.12.2)
which will have a better support for go collector of different go versions,
and fix NaN value.

Fixes: cilium#19985

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki requested review from a team as code owners June 7, 2022 12:25
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.10 kind/backports This PR provides functionality previously merged into master. labels Jun 7, 2022
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment on GHA, the rest looks good for my changes.

@@ -121,6 +121,9 @@ replace (

go.universe.tf/metallb => github.com/cilium/metallb v0.1.1-0.20210607221240-b4c60b959dd7

// Make sure to use grpc v1.29.1 as grpc/naming package was removed in newer versions.
google.golang.org/grpc => google.golang.org/grpc v1.29.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added due to:

go: finding module for package google.golang.org/grpc/naming
github.com/cilium/cilium/pkg/kvstore imports
	go.etcd.io/etcd/clientv3 tested by
	go.etcd.io/etcd/clientv3.test imports
	go.etcd.io/etcd/integration imports
	go.etcd.io/etcd/proxy/grpcproxy imports
	google.golang.org/grpc/naming: module google.golang.org/grpc@latest found (v1.47.0), but does not contain package google.golang.org/grpc/naming

Perhaps better would be to bump etcd here as well? Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine to keep the grpc lib pinned here. This is a v1.10 branch which we won't typically expect changes to vendor moving forward.

rolinh and others added 5 commits June 7, 2022 14:34
[ upstream commit 7c86fe7 ]

When the service port for the peer service is not specified, it is
automatically assigned port 80 or port 443 (when TLS is enabled). Using
these ports make it easy to understand whether TLS is enabled for the
service or not. Moreover, it makes the behavior consistent with the
Hubble Relay service.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit f439177 ]

This commit fixes a bug where the keys of the forward map inside the DNS
cache were never removed, causing the map to grow forever. By contrast,
the reverse map keys were being deleted.

For both the forward and reverse maps (which are both maps whose values
are another map), the inner map keys were being deleted.

In other words, the delete on the outer map key was missing for the
forward map.

In addition to fixing the bug, this commit expands the unit test
coverage to assert after any deletes (entries expiring or GC) that the
forward and reverse maps contain what we expect.

Particularly, in an environment where there are many unique DNS lookups
(unique FQDNs) being done, this forward map could grow quite large over
time, especially for a long-lived workload (endpoint). This fixes this
memory-leak-like bug.

Fixes: cf387ce ("fqdn: Introduce TTL-aware cache for DNS retention")
Fixes: f6ce522 ("FQDN: Added garbage collector functions.")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 3500290 ]

This commit is to add ownerReferences for CiliumNodes created by
CiliumExternalWorkload, so that we don't unintentionally GC invalid CN.

Thanks to @nathanejohnson for reporting this issue.

Fixes: cilium#19907
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit f9863ec ]

Signed-off-by: Vlad Ungureanu <ungureanuvladvictor@gmail.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit aa7572b ]

Signed-off-by: Vlad Ungureanu <ungureanuvladvictor@gmail.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit c6af580 ]

This commit adds the `-o json` output to `cilium node list` and
`cilium-health status`, as the text version of both does not contain all
details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki force-pushed the pr/v1.10-backport-2022-06-07 branch from 82f8351 to 3b1ebc8 Compare June 7, 2022 12:35
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and looks good for my changes 🎖️

@joamaki
Copy link
Contributor Author

joamaki commented Jun 7, 2022

/test-backport-1.10

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Unable to retrieve DNS pods to scale down, command 'kubectl get deploy -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].status.replicas}'': Exitcode: -1 

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commit looks good

Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github-sec review is not required anymore following drop of commit editing the workflow files, approving just to get rid of the review request.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, my commits look good!

@joamaki joamaki merged commit b1e007a into cilium:v1.10 Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants