Skip to content

Conversation

glibsm
Copy link
Member

@glibsm glibsm commented Oct 4, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 16969 16974 17341 17489 17469 17154 17492; do contrib/backporting/set-labels.py $pr done 1.10; done

or with

$ make add-label branch=v1.10 issues=16969,16974,17341,17489,17469,17154,17492

bluikko and others added 9 commits October 4, 2021 18:38
[ upstream commit a8b3480 ]

It was not clear if kernel v5.8 has problem with libceph or if 5.8 fixes the problem.
Redo the sentence based on feedback to make it more clear and easy to read.

Signed-off-by: Ville Ojamo <bluikko@users.noreply.github.com>

Signed-off-by: Glib Smaga <code@gsmaga.com>
[ upstream commit cee08cd ]

This allows users who do not want cilium populating neighbor table with
mac addresses of neighbors cilium might have discovered via its
discovery process to opt out of cilium's neighbor discovery
mechanisms.

Signed-off-by: Ayodele Abejide <abejideayodele@gmail.com>

Signed-off-by: Glib Smaga <code@gsmaga.com>
[ upstream commit 5936a5a ]

This type will be useful for serializing cache entries for communication
between agent and external components

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>

Signed-off-by: Glib Smaga <code@gsmaga.com>
[ upstream commit fcc345e ]

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>

Signed-off-by: Glib Smaga <code@gsmaga.com>
[ upstream commit f6f2406 ]

coredns < 1.7.0 has a bug that makes the services resolution to become
out-of-sync with the last state from Kubernetes in case coredns
suffers from a disconnection with kube-apiserver [1]. This bug is fixed on
all versions equal and above 1.7.0. [2]

In our CI this affects all Kubernetes jobs 1.18 and below and can result
in flaky tests that have the result in the following similar logs:

```
service IP retrieved from DNS (10.101.253.144) does not match the IP for the service stored in Kubernetes (10.108.15.225)
```

[1] coredns/coredns#3587
[2] coredns/coredns#3924

Signed-off-by: André Martins <andre@cilium.io>

Signed-off-by: Glib Smaga <code@gsmaga.com>
[ upstream commit 0410c92 ]

xref: #16974

Signed-off-by: Ayodele Abejide <abejideayodele@gmail.com>

Signed-off-by: Glib Smaga <code@gsmaga.com>
[ upstream commit f7303af ]

This is for compatibility with Istio in kube-proxy free mode.
Currently, even though Istio would still get all traffic within pod
namespace, but the original service VIP is lost during socket lb,
causing it to miss all Istio routing chains and therefore bypassing
all Istio functionalities.

This adds a new option to bypass socket lb in pod namespace. When
enabled, service resolution for connection from pod namespaces will be
handled in bpf_lxc at veth. For host-namespaced pods, socket lb kept as
is.

Signed-off-by: Weilong Cui <cuiwl@google.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>

Signed-off-by: Glib Smaga <code@gsmaga.com>
[ upstream commit c88e06f ]

This will be used in the subsequent commit to check for a specific error
when allocating an IP.

Signed-off-by: Chris Tarazi <chris@isovalent.com>

Signed-off-by: Glib Smaga <code@gsmaga.com>
[ upstream commit baa9e9a ]

In CRD-backed IPAM modes such as ENI mode, some IPs were recently
removed from the IPAM pool [1]. In user environments where Cilium was
running a version prior to [1], it is possible for endpoints to be
assigned "unavailable" IPs. Upon upgrade (to a version which includes
[1]), endpoint restoration will fail with [2].

In order to workaround this failure and not disrupt the upgrade, this
change introduces a hidden flag
(`--bypass-ip-availability-upon-restore`) which will inform Cilium to
continue on if the restored endpoint's IP is not available for
reallocation, bypassing the specific error "IP is not available". Other
errors will not be bypassed, in order to reduce the scope of this
stop-gap solution.

With the flag set, restored endpoints which had "unavailable" IPs will
keep them. Any new endpoints / pods will be assigned fresh, valid IPs
from the pool.

This flag is only meant to be enabled with CRD-backed IPAM modes such as
ENI mode. The reason is because of the change described in [1], where
the primary ENI IP was removed from the IPAM pool. In any other mode
that this flag is enabled in, the user is warned that the flag is not
intended for other modes and will have no effect.

This patch is intended to be reverted in the future, as this stop-gap
solution will no longer be required, as users of Cilium don't upgrade
from versions prior to [1]. I propose that we revert this in the
following release that this patch makes it in (N+1).

How was this tested?

1) Deployed a Cilium version that doesn't include [1] on EKS cluster
2) Created a Deployment object which I scaled to max out the ENI IPs,
   such that at least one pod is assigned an "unavailable" IP
3) Upgraded Cilium to a version which does include [1] and observe [2]
   failures
4) Reset cluster back to state from (2)
5) Upgrade Cilium to the version that contains this commit
6) Observe log msgs from this commit and endpoint restoration succeeding
7) Scale Deployment to 0 and back up, to restart all pods
8) Observe that they all get fresh IPs and none of the "unavailable"
   IPs

[1]: #15453
[2]:

```
{
"time":"2021-09-20T16:57:00.400086481Z",
"level":"WARN",
"origin":"cilium.io/agent",
"message":"Unable to restore endpoint, ignoring",
"params":{
    "endpointID":"992",
    "error":"Failed to re-allocate IP of endpoint: unable to reallocate 10.0.133.193 IPv4 address: IP 10.0.133.193 is not available",
    "k8sPodName":"default/pod-1",
    "subsys":"daemon"
  }
}
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>

Signed-off-by: Glib Smaga <code@gsmaga.com>
@glibsm glibsm requested a review from a team as a code owner October 4, 2021 21:30
@glibsm glibsm requested review from brb, christarazi, nebril and aanm October 4, 2021 21:31
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.10 kind/backports This PR provides functionality previously merged into master. labels Oct 4, 2021
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my commit

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes LGTM, thanks.

@glibsm
Copy link
Member Author

glibsm commented Oct 6, 2021

/test-backport-1.10

Job 'Cilium-PR-K8s-1.19-kernel-5.4' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Encapsulation Check connectivity with sockops and VXLAN encapsulation

Failure Output

FAIL: Error creating resource /home/jenkins/workspace/Cilium-PR-K8s-1.19-kernel-5.4/src/github.com/cilium/cilium/test/k8sT/manifests/l3-policy-demo.yaml: Cannot retrieve cilium pod cilium-p6w8t policy revision: cannot get revision from json: could not parse JSON from command "kubectl exec -n kube-system cilium-p6w8t -- cilium policy get -o json"

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-5.4 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-Runtime-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

RuntimePrivilegedUnitTests Run Tests

Failure Output

FAIL: Failed to run privileged unit tests

If it is a flake, comment /mlh new-flake Cilium-PR-Runtime-4.9 so I can create a new GitHub issue to track it.

@christarazi
Copy link
Member

christarazi commented Oct 6, 2021

@glibsm It looks like test-runtime-4.9 hit a known issue but was only fixed in master, so it needs backport to v1.10. You could probably just mark the PR for needs-backport, but not actually include in this one.

And test-1.19-5.4 hit known flake: #17069

@glibsm
Copy link
Member Author

glibsm commented Oct 7, 2021

Marking ready based on previous comment #17531 (comment)

@glibsm glibsm added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 7, 2021
@errordeveloper errordeveloper merged commit b4c9323 into v1.10 Oct 7, 2021
@errordeveloper errordeveloper deleted the pr/v1.10-backport-2021-10-04 branch October 7, 2021 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants