Skip to content

pkg/node: fix repeated error reporting in neighbor-link-updater #35179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

wedaly
Copy link

@wedaly wedaly commented Oct 2, 2024

Previously, neighbor-link-updater set the health status error for each neighboring node to a concatenation of all errors from all nodes processed so far.

Fix it by storing only the error from refreshing the particular node.

Avoid duplicate errors in health status for node-neighbor-link-updater

@wedaly wedaly requested a review from a team as a code owner October 2, 2024 15:54
@wedaly wedaly requested a review from ldelossa October 2, 2024 15:54
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 2, 2024
@wedaly
Copy link
Author

wedaly commented Oct 2, 2024

/test

@wedaly
Copy link
Author

wedaly commented Oct 2, 2024

discussed in cilium community meeting this morning: as this addreses memory spikes in Cilium 1.16 this should be backported to 1.16. (I don't seem to have permission to add the backport label)

Copy link
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@wedaly
Copy link
Author

wedaly commented Oct 7, 2024

Test flakes:

  1. known issue: CI: Cilium E2E Upgrade - failed to update lock #31374

  2. https://github.com/cilium/cilium/actions/runs/11150663588/job/30992420399 failed with:

    docker: Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by 
    authenticating and upgrading: https://www.docker.com/increase-rate-limit.
    
  3. Hubble CLI integration tests timed out? https://github.com/cilium/cilium/actions/runs/11150664494/job/30992328712

  4. https://github.com/cilium/cilium/actions/runs/11150663833/job/30992326435

    time="2024-10-02T20:11:08.922168043Z" level=fatal msg="Unable to connect to kvstore" error="timed out while starting the heartbeat watcher. Ensure that etcd is running on [http://127.0.0.1:4002/]" module=etcd subsys=kvstore
    

Will rerun tests and open issues for the new flakes if they pass on the second run.

@wedaly
Copy link
Author

wedaly commented Oct 7, 2024

/test

1 similar comment
@wedaly
Copy link
Author

wedaly commented Oct 7, 2024

/test

@wedaly
Copy link
Author

wedaly commented Oct 7, 2024

/ci-ginkgo

@wedaly
Copy link
Author

wedaly commented Oct 7, 2024

/ci-l4lb

@wedaly
Copy link
Author

wedaly commented Oct 7, 2024

/ci-e2e-upgrade

@wedaly
Copy link
Author

wedaly commented Oct 7, 2024

/ci-ginkgo

1 similar comment
@wedaly
Copy link
Author

wedaly commented Oct 7, 2024

/ci-ginkgo

Previously, neighbor-link-updater set the health status error
for each neighboring node to a concatenation of all errors from
all nodes processed so far.

Fix it by storing only the error from refreshing the particular node.

Signed-off-by: Will Daly <widaly@microsoft.com>
@wedaly wedaly force-pushed the widaly/fix-duplicate-l2-neighbor-errors branch from 5f2f7ef to fa81ca7 Compare October 8, 2024 20:47
@wedaly
Copy link
Author

wedaly commented Oct 8, 2024

/test

@wedaly
Copy link
Author

wedaly commented Oct 9, 2024

ci-ginkgo:

[Fail] K8sDatapathBGPTests Tests LoadBalancer [AfterEach] Connectivity to endpoint via LB
    2024-10-08T22:10:54.467959919Z time="2024-10-08T22:10:54.46773518Z" level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" related-metric=hubble_lost_events_total subsys=hubble

#30660

ci-ipsec-e2e:

[.] Action [no-policies/pod-to-world/http-to-one.one.one.one.-0: cilium-test-1/client-6db7b75479-lbtpf (10.244.4.230) -> one.one.one.one.-http (one.one.one.one.:80)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 -H Host: one.one.one.one --retry 3 --retry-all-errors --retry-delay 3 [http://one.one.one.one.:80](http://one.one.one.one./)" failed: error with exec request (pod=cilium-test-1/client-6db7b75479-lbtpf, container=client): command terminated with exit code 28

ci-ingress

  Scenario Outline: An Ingress with no rules should send all requests to the default backend and # features/load_balancing.feature:19
    Then The backend deployment "echo-service" for the ingress resource is scaled to 10 # features/load_balancing.feature:17
      Error: waiting for service (echo-service) endpoints available: timed out waiting for the condition

#24355

ci-runtime

• Failure in Spec Setup (BeforeEach) [550.067 seconds]
RuntimeAgentFQDNPolicies
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:[461](https://github.com/cilium/cilium/actions/runs/11244449590/job/31262630103#step:16:462)
  toFQDNs populates toCIDRSet (data from proxy)
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
    Policy addition after DNS lookup [BeforeEach]
    /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515

    Endpoints are not ready after timeout
    Expected
        <bool>: false
    to be true

#34731

ci-integration

time="2024-10-08T22:20:52.501757099Z" level=fatal msg="Unable to connect to kvstore" error="timed out while starting the heartbeat watcher. Ensure that etcd is running on [http://127.0.0.1:4002/]" module=etcd subsys=kvstore

@wedaly
Copy link
Author

wedaly commented Oct 9, 2024

/test

@wedaly
Copy link
Author

wedaly commented Oct 9, 2024

ci-ginko failed again with a different error:

https://github.com/cilium/cilium/actions/runs/11258851229/job/31306534220

FAIL: Failed waiting for  ipcache entry on k8s1
Expected
    <*fmt.wrapError | 0xc000415e20>: 
    ipcache entry for  was not found in time: 4m0s timeout expired
    {
        msg: "ipcache entry for  was not found in time: 4m0s timeout expired",
        err: <*errors.errorString | 0xc0008f9730>{
            s: "4m0s timeout expired",
        },
    }
to be nil

@joestringer joestringer added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Oct 9, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 9, 2024
@joestringer joestringer merged commit 0a9d63f into cilium:main Oct 9, 2024
61 of 63 checks passed
@wedaly wedaly deleted the widaly/fix-duplicate-l2-neighbor-errors branch October 9, 2024 22:41
@tklauser tklauser mentioned this pull request Oct 22, 2024
10 tasks
@tklauser tklauser added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Oct 22, 2024
@github-actions github-actions bot removed the backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. label Oct 22, 2024
@github-actions github-actions bot added the backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants