Skip to content

routing: Delete stale ingress and egress rules #39734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pippolo84
Copy link
Member

@pippolo84 pippolo84 commented May 26, 2025

    routing: Delete all stale egress and ingress rules
    
    The current rule deletion logic skips the removal in case more than one
    rule that matches the requested template is found. Therefore, if a stale
    rule is not garbage collected properly and another rule for the same
    address is installed, Cilium stops removing the installed rules,
    accumulating them indefinitely.
    
    The change addresses this issue deleting all the stale ingress and
    egress rules matching the specified template, even when more than one
    occurrence is found. In order to do so, a more generic
    deleteRulesFiltered is added, where all the installed rules matching a
    given template are then evaluated by a series of filter to decide if it
    should be removed or not.
    
    In case of egress rules, the new logic skips the deletion of those that
    refers to a table ID not managed by Cilium, that is, a table ID outside
    of the per ENI routing table range.
    
    The unit tests are changed accordingly, therefore expecting a success
    instead of an error when more than one rule is found.

Among the others, the result of the last TestDelete sub-test in pkg/datapath/linux/routing/routing_test.go, added in #24933, is altered by this PR.
The subtest was meant to verify that a change in the masquerading option generates an error when trying to remove the previously installed rules. Specifically, the workflow simulated in the test was:

  1. the agent is started with masquerading enabled
  2. as a consequence, egress routing rules are installed with a non-nil destination CIDR (see Configure)
  3. the agent is restarted with masquerade disabled
  4. the removal of the egress rules should now fail because the template is expecting a nil destination CIDR

But the test was actually failing due to the fact that in the test setup two rules where installed:

logger.go:256: time=2025-05-27T11:22:30.161+02:00 level=WARN source=/home/pippolo/cilium/cilium-tmp/pkg/datapath/linux/routing/routing.go:391 msg="Found too many rules matching, skipping deletion" candidates="[ip rule 111: from 192.168.2.123/32 to 192.168.0.0/16 table 11  ip rule 111: from 192.168.2.123/32 to 192.170.0.0/16 table 11 ]" rule="111: from 192.168.2.123/32 to all lookup 0 proto unspec"

In other words, the previous logic was not logging any error when a single stale egress rule with a non-nil destination CIDR was found. The error seen in the test was just a consequence of the limitation related to the multiple rules found.

With the changes in this PR, we remove all stale egress rules with a non-nil destination CIDR, since the template rule passed to deleteRulesFiltered has a nil dst CIDR, corresponding to a 0.0.0.0/0 CIDR. Therefore, all the previous rules are selected by route.ListRules and removed.
Note that the opposite (that is, restarting the agent after enabling masquerading) still leaves stale rules behind. This was not solved by the previous approach either, but requires #38285 instead.

Notes to reviewer: please review each commit individually

Remove stale ingress and egress routing rules after the deletion of an endpoint.

@pippolo84 pippolo84 added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/ipam IP address management, including cloud IPAM labels May 26, 2025
@pippolo84

This comment was marked as outdated.

pippolo84 added 3 commits May 27, 2025 11:01
Refactor computeTableIDFromIfaceNumber to take the compat flag as a
parameter. Depending on the compat flag value, the helper returns the
iface number to be used directly as the table ID or a the same iface
number plus a constant offset.

This refactor does not change anything but it will be useful in a
subsequent change.

Related: 67dc983 ("routing: Fix route collisions in AWS ENI")

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
In order to simplify the ingress and engress routing rules deletion
logic, removes the per family functions and rely on the underlying
deleteRule function directly, passing a family parameter.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Instead of relying on t.Log to separate each subtest output, use the
idiomatic table-driven test structure calling t.Run for each subtest.

Also, improve the error message reporting the error returned by Delete.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
@pippolo84 pippolo84 force-pushed the pr/pippolo84/ipam-delete-stale-rules branch 2 times, most recently from 4eff60e to ea7f1a1 Compare May 27, 2025 09:04
@pippolo84 pippolo84 added the area/eni Impacts ENI based IPAM. label May 27, 2025
@pippolo84

This comment was marked as outdated.

@pippolo84 pippolo84 force-pushed the pr/pippolo84/ipam-delete-stale-rules branch 2 times, most recently from e2f982e to c87bb06 Compare May 27, 2025 09:30
@pippolo84

This comment was marked as outdated.

@pippolo84 pippolo84 force-pushed the pr/pippolo84/ipam-delete-stale-rules branch from c87bb06 to 8782267 Compare May 27, 2025 09:43
@pippolo84

This comment was marked as outdated.

@pippolo84 pippolo84 force-pushed the pr/pippolo84/ipam-delete-stale-rules branch from 8782267 to ac21f25 Compare May 27, 2025 09:52
@pippolo84

This comment was marked as outdated.

The current rule deletion logic skips the removal in case more than one
rule that matches the requested template is found. Therefore, if a stale
rule is not garbage collected properly and another rule for the same
address is installed, Cilium stops removing the installed rules,
accumulating them indefinitely.

The change addresses this issue deleting all the stale ingress and
egress rules matching the specified template, even when more than one
occurrence is found. In order to do so, a more generic
deleteRulesFiltered is added, where all the installed rules matching a
given template are then evaluated by a series of filter to decide if it
should be removed or not.

In case of egress rules, the new logic skips the deletion of those that
refers to a table ID not managed by Cilium, that is, a table ID outside
of the per ENI routing table range.

The unit tests are changed accordingly, therefore expecting a success
instead of an error when more than one rule is found.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
@pippolo84 pippolo84 force-pushed the pr/pippolo84/ipam-delete-stale-rules branch from ac21f25 to 181a22c Compare May 27, 2025 10:06
@pippolo84

This comment was marked as outdated.

@pippolo84 pippolo84 marked this pull request as ready for review May 27, 2025 13:43
@pippolo84 pippolo84 requested review from a team as code owners May 27, 2025 13:43
@pippolo84
Copy link
Member Author

cc @gandro @christarazi

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for @cilium/endpoint

@pippolo84
Copy link
Member Author

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 28, 2025
@julianwiedmann julianwiedmann added this pull request to the merge queue May 28, 2025
Merged via the queue into cilium:main with commit 2bd5751 May 28, 2025
75 checks passed
@pippolo84 pippolo84 added affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch labels Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/eni Impacts ENI based IPAM. area/ipam IP address management, including cloud IPAM ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants