Skip to content

Conversation

jibi
Copy link
Member

@jibi jibi commented Sep 29, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 17382 17288 17329 17359 17378 16264 17230 17424 17411 17417 17439 17432 16585 17477 17418; do contrib/backporting/set-labels.py $pr done 1.10; done

joestringer and others added 12 commits September 29, 2021 11:57
[ upstream commit 9e740b1 ]

The section that this guide refers to is now its own dedicated page
guide, and users can use any environment to test it out. Fix the
redirect.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 98a995c ]

Use "sort -V" (versions) rather than "sort -n" (numeric) so that the
docs list the minor versions in chronological order.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 71a65cb ]

We don't need to implement this logic for two reasons:

1) We rely on CiliumNode resources to be deleted / cleaned up by
   attaching the corresponding K8s Node as an `ownerReference` in the
   CiliumNode.
2) It is redundant to delete the CiliumNode in response to an
   event...of the CiliumNode deletion itself.

In very rare cases, this logic can actually delete a newly created
CiliumNode by accident (see example below). Instead, keep all deletion
logic besides the actual K8s API calls (DELETE) and perform a Get() to
ensure that it's been deleted. Otherwise, log to the user that the
resource may still exist.

Example:
Say an existing node was deleted and then recreated in
quick succession with the same name. When the node is recreated, the
agent will be scheduled on it. During bootstrap it'll create a
corresponding CiliumNode resource. Given that only one Operator is
operational at any time in a cluster, it is already running on another
node in the cluster. The node-delete event will first delete the K8s
node and then trigger a CN-delete via reason 1 from above. It is
possible for the CN-delete event to be delayed such that it is received
after the node-create event (the recreate). When the CN-delete event is
received by the already-running Operator, the CiliumNode watcher logic
will then trigger (erroneously) another CN-delete, thereby deleting the
CiliumNode resource while the K8s node is still alive.

Fixes: 6d44f4c ("operator: sync cilium nodes to kvstore instead of k8s nodes")
Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
…mNode

[ upstream commit b0c3393 ]

It is impossible to set the OwnerReference if we fail to fetch the
corresponding Kubernetes Node and the existing CiliumNode resource
doesn't already have it set. We can rely the OwnerReference to be set
because this logic was added in v1.6, which is sufficiently earlier
version of Cilium. [1]

The reason for doing this is to ensure that the OwnerReference can
always be set. If we cannot, this should be treated as an error and we
shouldn't proceed. Cilium should not run in an environment where the
Kubernetes Node resource is missing.

[1]: 5c365f2 ("ipam: Automatically
create CiliumNode resource on startup")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 2b44dcb ]

This is useful in warning or error level messages to help nudge the user
in the right direction when troubleshooting.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit ede69e8 ]

With this commit, the identity GC rate limit
(--identity-gc-rate-interval) becomes the effective rate at which
identities are garbage collected. Previously, the identity GC interval
(--identity-gc-interval) would cause the Operator to GC for that much
time, then the sleep for that much time, rinse and repeat, effectively
halving the rate.

To use concrete numbers for an example, let's say our interval is 5m and
our GC rate interval is 1000 per minute. It would mean that previously,
we would GC 5000 identities at a maximum for 10m (assuming that deletion
takes 0s). How was that calculated? Each minute, we GC 1000 identities.
After 5m, we have GC'd 5000 identities. But now we have to sleep for 5m
because that's our GC interval. Hence making our effective GC rate 500
per minute (instead of being 1000/m).

Now, we compute the time taken to perform the actual GC and subtract
that from the interval. So in our above example, we eliminate the dead
time of 5m and avoid slashing our effective GC rate in half. This change
allows the Operator to keep up with the demand more efficiently. The
Operator will warn if the GC duration took longer than the interval and
set the sleep duration to 0.

Suggested-by: Joe Stringer <joe@cilium.io>
Suggested-by: Dan Wendlandt <dan@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 3441acc ]

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 27fd5cc ]

Signed-off-by: Stijn Smits <stijn@stijn98s.nl>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit d204d78 ]

The new option is used to specify a device which globally scoped IP addr
should be used for BPF-based masquerading.

This is a workaround for an environment which uses ECMP for outgoing
traffic via multiple devices and it has a dedicated device which IP addr
should be used for the masquerading. The workaround is relevant until
#17158 has been resolved (thus,
we hide the flag).

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 83d30de ]

Having these environment variables allows the cherry-pick script to be
used on other projects that are not Cilium.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit c40ed79 ]

Before this commit, Hubble was ignoring proxy redirection information
from the policy-verdict events it received from the datapath. For
example, a cilium monitor event such as:

    Policy verdict log: flow 0x0 local EP ID 1531, remote ID 35429, proto 17, egress, action redirect, match L3-L4, 10.240.0.62:37282 -> 10.240.0.63:53 udp

would be displayed in hubble observe as:

    Sep 15 17:23:11.960: cilium-test/client-6488dcf5d4-f9kfl:37282 -> kube-system/coredns-d4866bcb7-zh5jv:53 L3-L4 FORWARDED (UDP)

This commit adds a new verdict REDIRECTED to signal such event. Such
events now appear as:

    default/pod-to-external-fqdn-allow-google-cnp-5ff4986c89-n87h2:58314 -> kube-system/coredns-755cd654d4-j4vzh:53 UNKNOWN 5 (UDP)

A subsequent patch to the Hubble command line will display value 5 as
"REDIRECTED".

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 9e4d84b ]

The Kubernetes' client User-Agent was never set and it would always
fallback to the default value. This commit fixes this issue and now all
Cilium components will correctly present their User-Agent.

Fixes: b31ed33 ("Add k8s client qps and burst as cli flags for the operator")
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@jibi jibi added kind/backports This PR provides functionality previously merged into master. backport/1.10 labels Sep 29, 2021
@jibi jibi requested review from a team as code owners September 29, 2021 10:21
@jibi jibi force-pushed the pr/v1.10-backport-2021-09-29 branch from df8d246 to 386b917 Compare September 29, 2021 10:27
geakstr and others added 4 commits September 29, 2021 12:34
[ upstream commit 09f3c81 ]

Signed-off-by: Dmitry Kharitonov <dmitry@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 9008255 ]

The public function ForceExpiredByNames is not executed from anywhere so
this function can be safely removed.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 8983227 ]

In the FQDN architecture there's a DNS Cache per endpoint, used to track
which domain names each endpoint makes DNS requests, and a global DNS
Cache where its main functionality is to help tracking which
api.FQDNSelector present in the policy applies to locally running
endpoints. The latter, as opposed to the former, didn't have any
cleanup mechanism for the map that tracked which entries should be
garbage collected, making the global DNS Cache to grow.
This commit prevents those entries from being tracked for Garbage
Collection in the global DNS Cache.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit b281dd7 ]

Kubernetes 1.21 automatically adds a new label to all namespaces when
the NamespaceDefaultLabelName feature gate is enabled.
(https://kubernetes.io/docs/concepts/overview/_print/#automatic-labelling)

This commit adds an additional entry for all well-known identities
adding that label.

Signed-off-by: Mauricio Vásquez <mauricio@accuknox.com>
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit 5d37a2f ]

The Makefile contains all component versions which are then used to
generate the helm charts. This commit fixes some of those versions that
got out-of-sync with the right versions.

Fixes: 206105f ("helm: use 'quay.io/cilium/certgen:v0.1.5'")
Fixes: 09f3c81 ("helm: upgrade envoy to v1.18.4 for hubble-ui")
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
[ upstream commit c4773d8 ]

As image versions are supposed to be set in the Makefile, we should add
a step on the GH workflow to verify the correctness of those versions.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@jibi jibi force-pushed the pr/v1.10-backport-2021-09-29 branch from 386b917 to 0560a80 Compare September 29, 2021 10:34
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PRs look good 👍

[ upstream commit 6dbabed ]

In initExcludedIPs() we build a list of IPs that Cilium needs to exclude
to operate. One check to determine if an IP should be excluded is based
on the state of the net device: if the device is not up, then its IPs
are excluded.

Unfortunately, this check is not enough, as it's possible to have a
device reporting an unknown state (because its driver is missing the
operstate handling, e.g. a dummy device) while still being operational.

This commit changes the logic in initExcludedIPs() to not exclude IPs of
devices reporting an unknown state.

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@jibi jibi closed this Sep 29, 2021
@jibi jibi reopened this Sep 29, 2021
@jibi
Copy link
Member Author

jibi commented Sep 29, 2021

test-backport-1.10

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For docs & GH workflows changes:

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my commits. Thanks!

@aanm aanm merged commit 4204f66 into v1.10 Oct 2, 2021
@aanm aanm deleted the pr/v1.10-backport-2021-09-29 branch October 2, 2021 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.