set explicit liveness/readiness probe timeout for deny connectivity checks #10581
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
examples/kubernetes/connectivity-check.yaml includes a test that is expected to result in L3 denies if Cilium is operating correctly. It validates this with liveness/readiness problems that use bash to return the negation of the value returned by curl (i.e., if curl exits with an "error", this is the correct result, and so the bash command returns 0 and the readiness/liveness probe succeeds).
The curl command has an explicit 5 second timeout, however, the liveness + readiness problems have a default 1 second timeout. This means that if curl does not exit within 1 second, kubernetes will give up on the readiness/liveness probe and declare it to have failed.
With this patch we explicitly set the readiness/liveness probe timeouts to 7 seconds, so that curl has time to have its timeout timer (set to 5 seconds) to trigger. This allows the probe to keep running long enough for the curl command to return a non-zero exit code, which because of the bash negation, will cause the probe to succeed.
Note: it is not clear to me why the lack of this explicit timeout does not cause issues in all k8s environments, but it seems like the failures only happen in specific environments. However, in these specific environments, the failures happen reliably. It may be due to differences in DNS or other configuration in those environments. For examples, EKS with bottlerocket OS (https://github.com/weaveworks/eksctl/blob/master/examples/20-bottlerocket.yaml) shows this behavior.
Signed-off-by: Dan Wendlandt dan@covalent.io
This change is