Skip to content

CI: K8sPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath: provided port is already allocated #13071

@christarazi

Description

@christarazi

CI failure

/home/jenkins/workspace/Cilium-PR-K8s-1.12-net-next/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:514
cannot install connectivity-check
Expected command: kubectl apply --force=false -f /home/jenkins/workspace/Cilium-PR-K8s-1.12-net-next/src/github.com/cilium/cilium/examples/kubernetes/connectivity-check/connectivity-check-proxy.yaml 
To succeed, but it failed:
Exitcode: 1 
Err: exit status 1
Stdout:
 	 deployment.apps/echo-c created
	 deployment.apps/echo-c-host created
	 deployment.apps/pod-to-a-intra-node-proxy-egress-policy created
	 deployment.apps/pod-to-a-multi-node-proxy-egress-policy created
	 deployment.apps/pod-to-c-intra-node-proxy-ingress-policy created
	 deployment.apps/pod-to-c-multi-node-proxy-ingress-policy created
	 deployment.apps/pod-to-c-intra-node-proxy-to-proxy-policy created
	 deployment.apps/pod-to-c-multi-node-proxy-to-proxy-policy created
	 service/echo-c created
	 service/echo-c-headless created
	 service/echo-c-host-headless created
	 ciliumnetworkpolicy.cilium.io/pod-to-a-intra-node-proxy-egress-policy created
	 ciliumnetworkpolicy.cilium.io/pod-to-a-multi-node-proxy-egress-policy created
	 ciliumnetworkpolicy.cilium.io/pod-to-c-intra-node-proxy-to-proxy-policy created
	 ciliumnetworkpolicy.cilium.io/pod-to-c-multi-node-proxy-to-proxy-policy created
	 ciliumnetworkpolicy.cilium.io/echo-c created
	 deployment.apps/echo-a created
	 service/echo-a created
	 deployment.apps/echo-b created
	 deployment.apps/echo-b-host created
	 service/echo-b-host-headless created
	 
Stderr:
 	 The Service "echo-b" is invalid: spec.ports[0].nodePort: Invalid value: 31313: provided port is already allocated
	 

/home/jenkins/workspace/Cilium-PR-K8s-1.12-net-next/src/github.com/cilium/cilium/test/k8sT/Policies.go:1495

After investigating this, it turns out that there's a chance that Services that are created in the test suite could conflict with each other's nodePort. In some cases, we don't explicitly set the nodePort and leave it up to K8s to allocate a port between the range of 30000 and 32768. In other cases, we do explicitly set the nodePort. In this test failure, we had a Service created first and was allocated a random port of 31313, and later we deploy a Service with a set port to 31313, causing the conflict.

The reason the conflict occured in the first place is because of a failure to clean up, which we don't have logs for, as it happened in a different block and didn't trigger an error, which we usually ignore. We only retrieve output when an error occurs in a specific block [It(), etc]).

I posted on the #testing channel about different approaches and I will quote the post here, in case the Slack link expires.

I've come across a test failure that failed because two different Services conflicted on their nodePort. From what I can tell, a NodePort Service was allocated to port 31313 in the test suite, and then later in the test suite, a totally different NodePort Service was created, setting 31313 as the explicit NodePort. This obviously created a conflict and spit out a msg like:

The Service "echo-b" is invalid: spec.ports[0].nodePort: Invalid value: 31313: provided port is already allocated

The first Service should have been deleted, but it clearly was not cleaned up, and there are no logs to explain why. We ignore errors when cleaning up to prevent the test suite from halting for "minor" reasons, but it does allow us to get into these situations, where we've potentially polluted the next test.Specifically for Services, I see two approaches to prevent this in the future:

  1. Explicitly set the nodePort for every Service we create, so that we don't get bitten by the random allocation between 30000 - 32767.
  2. Where we set the nodePort explicitly, choose a port from outside the range (30000 - 32767), to avoid conflicts with random allocation within that range.

Actually I just realized that this happened in the same parent Context block, and has nested Contexts within it.Essentially it looks like:

Context A {
  BeforeAll() { Created Service with randomly allocated port 31313 }
  AfterAll()  { Cleaned up resources, but failed on Service above } ...
  Context B {
    BeforeAll() { Created Service with explicit port 31313; conflict }
    AfterAll() { 
      Fails to clean up because above Service wasn't created
    }
  }
}

Unfortunately, trying to dig into why the clean up failed would be moot, because we don't retrieve output from succeeding runs, so we still wouldn't have the reason of why it failed to clean up in the first place. We only have Context B's output (we actually only have the It() block within it). So that leaves us with the two aforementioned approaches above.

Found during #13053

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/CIContinuous Integration testing issue or flakeci/flakeThis is a known failure that occurs in the tree. Please investigate me!help-wantedPlease volunteer for this by adding yourself as an assignee!pinnedThese issues are not marked stale by our issue bot.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions