Skip to content

cilium: Unconditionally upsert neighbor entries #37352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 5, 2025
Merged

Conversation

borkmann
Copy link
Member

(see commit desc)

@borkmann borkmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. labels Jan 30, 2025
@borkmann borkmann requested a review from mhofstetter January 30, 2025 12:36
@borkmann borkmann requested a review from a team as a code owner January 30, 2025 12:36
@borkmann borkmann requested a review from ysksuzuki January 30, 2025 12:36
@borkmann
Copy link
Member Author

(👀 into runtime ci issue)

Some of the service runtime tests set up the service handling with
a nil s.backendDiscovery:

  m.svc = newService(&FakeMonitorAgent{}, lbmap, nil, nil, true)

Therefore, add a nil check in s.upsertBackendNeighbors and
s.deleteBackendNeighbors so that when the tests run they won't panic.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Starting with Cilium 1.18, we unconditionally upsert backend neighbor
entries. This means that if backends are in a different L2 domain, we
always keep the gateway entry "fresh" as managed neighbor, and if
backends are in the same L2 domain, the kernel always has up to date
hw address. Unconditionally given even with 'services with externalIPs'
type, the external IPs could be in the same L2. We already keep track
of in-cluster nodes to have them as managed neighbors, so this small
change additionally keeps track of external endpoints which is not
expected to add a lot to the total set of managed neighbors.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann
Copy link
Member Author

borkmann commented Feb 3, 2025

/test

@borkmann borkmann merged commit 2d40e52 into main Feb 5, 2025
293 of 294 checks passed
@borkmann borkmann deleted the pr/managed-neigh branch February 5, 2025 13:12
dylandreimerink added a commit to dylandreimerink/cilium that referenced this pull request Jun 10, 2025
PR cilium#37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit to dylandreimerink/cilium that referenced this pull request Jun 10, 2025
PR cilium#37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 11, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 11, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

This commit also explicitly enables L2 neighbor discovery during
testing in various places so we still get coverage for it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 11, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

This commit also explicitly enables L2 neighbor discovery during
testing in various places so we still get coverage for it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 11, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

This commit also explicitly enables L2 neighbor discovery during
testing in various places so we still get coverage for it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 11, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

This commit also explicitly enables L2 neighbor discovery during
testing in various places so we still get coverage for it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 11, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

This commit also explicitly enables L2 neighbor discovery during
testing in various places so we still get coverage for it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 11, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

This commit also explicitly enables L2 neighbor discovery during
testing in various places so we still get coverage for it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 11, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

This commit also explicitly enables L2 neighbor discovery during
testing in various places so we still get coverage for it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 11, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

This commit also explicitly enables L2 neighbor discovery during
testing in various places so we still get coverage for it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 12, 2025
PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
Forwardable IPs will still be created, but no further processing
will be done.

We already have guards in place in the desired neighbor calculator
that throws an error to notify the user when they are running with
XDP enabled but without L2 neighbor discovery enabled.

This commit also explicitly enables L2 neighbor discovery during
testing in various places so we still get coverage for it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 12, 2025
This commit introduced the reworked neighbor subsystem logic. The
neighbor logic that currently exists in the linux node handler was not
designed with resilience in mind and was missing a few key features.

The interface of the neighbor package towards the rest of the agent
is in the form of "Forwardable IPs". Any component that wishes to
guarantee that XDP can forward traffic to a given IP address can
register such a forwardable IP. Since the same IP could be "owned" by
multiple components (e.g. a service and node) we need some logic to
manage that. The `ForwardableIPManager` is responsible for that, and
all components have to go through it to register/remove their
forwardable IPs.

The `DesiredNeighborCalculator` takes forwardable IPs, L2 devices and
the routing table as inputs and calculates the "Desired Neighbors".
This works by checking the next hop for all permutations of
forwardable IPs and L2 devices, then deduplicating the results.
We can do a partial update when new forwardable IPs are added. We
periodically do a full update to deal with removal of forwardable IPs.
When the set of L2 devices changes or the contents of the routing table
change, we also do a full update at once.

Desired neighbors are reconciled to the kernel via a generic reconciler.
The existing logic for creating neighbor table entries is reused.
Entries can be kernel managed or agent managed, depending on the probed
kernel support. In addition we have a "Neighbor refresher" job. It
watches the neighbor statedb table for changes. When a agent managed
entry goes stale, it triggers a refresh of the entry via the reconciler.
This creates an active feedback loop instead of a recreating on a timer.
The refresher also monitors for deletions of the neighbor entries. If
we still desire a neighbor entry, it will be recreated.

As final touch, we use the table initializer mechanism to ensure pruning
of existing neighbor entries on startup only happens once all "sources"
of forwardable IPs confirm that they have added their initial list of
forwardable IPs. This ensures that we do not prune from the kernel
neighbor table until we have an initial set of desired neighbors to
reconcile.

All of these features together eliminate a set of issues that happen
when devices, routes and neighbors change outside of Cilium's control.
We also eliminate issues where some entries were not being refreshed.

PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
When XDP is enabled, we automatically also enable neighbor discovery.
When XDP is disabled, the user can still enable neighbor discovery
by passing the `--enable-l2-neighbor-discovery` flag to the agent.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 13, 2025
This commit introduced the reworked neighbor subsystem logic. The
neighbor logic that currently exists in the linux node handler was not
designed with resilience in mind and was missing a few key features.

The interface of the neighbor package towards the rest of the agent
is in the form of "Forwardable IPs". Any component that wishes to
guarantee that XDP can forward traffic to a given IP address can
register such a forwardable IP. Since the same IP could be "owned" by
multiple components (e.g. a service and node) we need some logic to
manage that. The `ForwardableIPManager` is responsible for that, and
all components have to go through it to register/remove their
forwardable IPs.

The `DesiredNeighborCalculator` takes forwardable IPs, L2 devices and
the routing table as inputs and calculates the "Desired Neighbors".
This works by checking the next hop for all permutations of
forwardable IPs and L2 devices, then deduplicating the results.
We can do a partial update when new forwardable IPs are added. We
periodically do a full update to deal with removal of forwardable IPs.
When the set of L2 devices changes or the contents of the routing table
change, we also do a full update at once.

Desired neighbors are reconciled to the kernel via a generic reconciler.
The existing logic for creating neighbor table entries is reused.
Entries can be kernel managed or agent managed, depending on the probed
kernel support. In addition we have a "Neighbor refresher" job. It
watches the neighbor statedb table for changes. When a agent managed
entry goes stale, it triggers a refresh of the entry via the reconciler.
This creates an active feedback loop instead of a recreating on a timer.
The refresher also monitors for deletions of the neighbor entries. If
we still desire a neighbor entry, it will be recreated.

As final touch, we use the table initializer mechanism to ensure pruning
of existing neighbor entries on startup only happens once all "sources"
of forwardable IPs confirm that they have added their initial list of
forwardable IPs. This ensures that we do not prune from the kernel
neighbor table until we have an initial set of desired neighbors to
reconcile.

All of these features together eliminate a set of issues that happen
when devices, routes and neighbors change outside of Cilium's control.
We also eliminate issues where some entries were not being refreshed.

PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
When XDP is enabled, we automatically also enable neighbor discovery.
When XDP is disabled, the user can still enable neighbor discovery
by passing the `--enable-l2-neighbor-discovery` flag to the agent.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 13, 2025
This commit introduced the reworked neighbor subsystem logic. The
neighbor logic that currently exists in the linux node handler was not
designed with resilience in mind and was missing a few key features.

The interface of the neighbor package towards the rest of the agent
is in the form of "Forwardable IPs". Any component that wishes to
guarantee that XDP can forward traffic to a given IP address can
register such a forwardable IP. Since the same IP could be "owned" by
multiple components (e.g. a service and node) we need some logic to
manage that. The `ForwardableIPManager` is responsible for that, and
all components have to go through it to register/remove their
forwardable IPs.

The `DesiredNeighborCalculator` takes forwardable IPs, L2 devices and
the routing table as inputs and calculates the "Desired Neighbors".
This works by checking the next hop for all permutations of
forwardable IPs and L2 devices, then deduplicating the results.
We can do a partial update when new forwardable IPs are added. We
periodically do a full update to deal with removal of forwardable IPs.
When the set of L2 devices changes or the contents of the routing table
change, we also do a full update at once.

Desired neighbors are reconciled to the kernel via a generic reconciler.
The existing logic for creating neighbor table entries is reused.
Entries can be kernel managed or agent managed, depending on the probed
kernel support. In addition we have a "Neighbor refresher" job. It
watches the neighbor statedb table for changes. When a agent managed
entry goes stale, it triggers a refresh of the entry via the reconciler.
This creates an active feedback loop instead of a recreating on a timer.
The refresher also monitors for deletions of the neighbor entries. If
we still desire a neighbor entry, it will be recreated.

As final touch, we use the table initializer mechanism to ensure pruning
of existing neighbor entries on startup only happens once all "sources"
of forwardable IPs confirm that they have added their initial list of
forwardable IPs. This ensures that we do not prune from the kernel
neighbor table until we have an initial set of desired neighbors to
reconcile.

All of these features together eliminate a set of issues that happen
when devices, routes and neighbors change outside of Cilium's control.
We also eliminate issues where some entries were not being refreshed.

PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
When XDP is enabled, we automatically also enable neighbor discovery.
When XDP is disabled, the user can still enable neighbor discovery
by passing the `--enable-l2-neighbor-discovery` flag to the agent.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 13, 2025
This commit introduced the reworked neighbor subsystem logic. The
neighbor logic that currently exists in the linux node handler was not
designed with resilience in mind and was missing a few key features.

The interface of the neighbor package towards the rest of the agent
is in the form of "Forwardable IPs". Any component that wishes to
guarantee that XDP can forward traffic to a given IP address can
register such a forwardable IP. Since the same IP could be "owned" by
multiple components (e.g. a service and node) we need some logic to
manage that. The `ForwardableIPManager` is responsible for that, and
all components have to go through it to register/remove their
forwardable IPs.

The `DesiredNeighborCalculator` takes forwardable IPs, L2 devices and
the routing table as inputs and calculates the "Desired Neighbors".
This works by checking the next hop for all permutations of
forwardable IPs and L2 devices, then deduplicating the results.
We can do a partial update when new forwardable IPs are added. We
periodically do a full update to deal with removal of forwardable IPs.
When the set of L2 devices changes or the contents of the routing table
change, we also do a full update at once.

Desired neighbors are reconciled to the kernel via a generic reconciler.
The existing logic for creating neighbor table entries is reused.
Entries can be kernel managed or agent managed, depending on the probed
kernel support. In addition we have a "Neighbor refresher" job. It
watches the neighbor statedb table for changes. When a agent managed
entry goes stale, it triggers a refresh of the entry via the reconciler.
This creates an active feedback loop instead of a recreating on a timer.
The refresher also monitors for deletions of the neighbor entries. If
we still desire a neighbor entry, it will be recreated.

As final touch, we use the table initializer mechanism to ensure pruning
of existing neighbor entries on startup only happens once all "sources"
of forwardable IPs confirm that they have added their initial list of
forwardable IPs. This ensures that we do not prune from the kernel
neighbor table until we have an initial set of desired neighbors to
reconcile.

All of these features together eliminate a set of issues that happen
when devices, routes and neighbors change outside of Cilium's control.
We also eliminate issues where some entries were not being refreshed.

PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
When XDP is enabled, we automatically also enable neighbor discovery.
When XDP is disabled, the user can still enable neighbor discovery
by passing the `--enable-l2-neighbor-discovery` flag to the agent.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 13, 2025
This commit introduced the reworked neighbor subsystem logic. The
neighbor logic that currently exists in the linux node handler was not
designed with resilience in mind and was missing a few key features.

The interface of the neighbor package towards the rest of the agent
is in the form of "Forwardable IPs". Any component that wishes to
guarantee that XDP can forward traffic to a given IP address can
register such a forwardable IP. Since the same IP could be "owned" by
multiple components (e.g. a service and node) we need some logic to
manage that. The `ForwardableIPManager` is responsible for that, and
all components have to go through it to register/remove their
forwardable IPs.

The `DesiredNeighborCalculator` takes forwardable IPs, L2 devices and
the routing table as inputs and calculates the "Desired Neighbors".
This works by checking the next hop for all permutations of
forwardable IPs and L2 devices, then deduplicating the results.
We can do a partial update when new forwardable IPs are added. We
periodically do a full update to deal with removal of forwardable IPs.
When the set of L2 devices changes or the contents of the routing table
change, we also do a full update at once.

Desired neighbors are reconciled to the kernel via a generic reconciler.
The existing logic for creating neighbor table entries is reused.
Entries can be kernel managed or agent managed, depending on the probed
kernel support. In addition we have a "Neighbor refresher" job. It
watches the neighbor statedb table for changes. When a agent managed
entry goes stale, it triggers a refresh of the entry via the reconciler.
This creates an active feedback loop instead of a recreating on a timer.
The refresher also monitors for deletions of the neighbor entries. If
we still desire a neighbor entry, it will be recreated.

As final touch, we use the table initializer mechanism to ensure pruning
of existing neighbor entries on startup only happens once all "sources"
of forwardable IPs confirm that they have added their initial list of
forwardable IPs. This ensures that we do not prune from the kernel
neighbor table until we have an initial set of desired neighbors to
reconcile.

All of these features together eliminate a set of issues that happen
when devices, routes and neighbors change outside of Cilium's control.
We also eliminate issues where some entries were not being refreshed.

PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
When XDP is enabled, we automatically also enable neighbor discovery.
When XDP is disabled, the user can still enable neighbor discovery
by passing the `--enable-l2-neighbor-discovery` flag to the agent.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit that referenced this pull request Jun 14, 2025
This commit introduced the reworked neighbor subsystem logic. The
neighbor logic that currently exists in the linux node handler was not
designed with resilience in mind and was missing a few key features.

The interface of the neighbor package towards the rest of the agent
is in the form of "Forwardable IPs". Any component that wishes to
guarantee that XDP can forward traffic to a given IP address can
register such a forwardable IP. Since the same IP could be "owned" by
multiple components (e.g. a service and node) we need some logic to
manage that. The `ForwardableIPManager` is responsible for that, and
all components have to go through it to register/remove their
forwardable IPs.

The `DesiredNeighborCalculator` takes forwardable IPs, L2 devices and
the routing table as inputs and calculates the "Desired Neighbors".
This works by checking the next hop for all permutations of
forwardable IPs and L2 devices, then deduplicating the results.
We can do a partial update when new forwardable IPs are added. We
periodically do a full update to deal with removal of forwardable IPs.
When the set of L2 devices changes or the contents of the routing table
change, we also do a full update at once.

Desired neighbors are reconciled to the kernel via a generic reconciler.
The existing logic for creating neighbor table entries is reused.
Entries can be kernel managed or agent managed, depending on the probed
kernel support. In addition we have a "Neighbor refresher" job. It
watches the neighbor statedb table for changes. When a agent managed
entry goes stale, it triggers a refresh of the entry via the reconciler.
This creates an active feedback loop instead of a recreating on a timer.
The refresher also monitors for deletions of the neighbor entries. If
we still desire a neighbor entry, it will be recreated.

As final touch, we use the table initializer mechanism to ensure pruning
of existing neighbor entries on startup only happens once all "sources"
of forwardable IPs confirm that they have added their initial list of
forwardable IPs. This ensures that we do not prune from the kernel
neighbor table until we have an initial set of desired neighbors to
reconcile.

All of these features together eliminate a set of issues that happen
when devices, routes and neighbors change outside of Cilium's control.
We also eliminate issues where some entries were not being refreshed.

PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
When XDP is enabled, we automatically also enable neighbor discovery.
When XDP is disabled, the user can still enable neighbor discovery
by passing the `--enable-l2-neighbor-discovery` flag to the agent.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Jun 14, 2025
This commit introduced the reworked neighbor subsystem logic. The
neighbor logic that currently exists in the linux node handler was not
designed with resilience in mind and was missing a few key features.

The interface of the neighbor package towards the rest of the agent
is in the form of "Forwardable IPs". Any component that wishes to
guarantee that XDP can forward traffic to a given IP address can
register such a forwardable IP. Since the same IP could be "owned" by
multiple components (e.g. a service and node) we need some logic to
manage that. The `ForwardableIPManager` is responsible for that, and
all components have to go through it to register/remove their
forwardable IPs.

The `DesiredNeighborCalculator` takes forwardable IPs, L2 devices and
the routing table as inputs and calculates the "Desired Neighbors".
This works by checking the next hop for all permutations of
forwardable IPs and L2 devices, then deduplicating the results.
We can do a partial update when new forwardable IPs are added. We
periodically do a full update to deal with removal of forwardable IPs.
When the set of L2 devices changes or the contents of the routing table
change, we also do a full update at once.

Desired neighbors are reconciled to the kernel via a generic reconciler.
The existing logic for creating neighbor table entries is reused.
Entries can be kernel managed or agent managed, depending on the probed
kernel support. In addition we have a "Neighbor refresher" job. It
watches the neighbor statedb table for changes. When a agent managed
entry goes stale, it triggers a refresh of the entry via the reconciler.
This creates an active feedback loop instead of a recreating on a timer.
The refresher also monitors for deletions of the neighbor entries. If
we still desire a neighbor entry, it will be recreated.

As final touch, we use the table initializer mechanism to ensure pruning
of existing neighbor entries on startup only happens once all "sources"
of forwardable IPs confirm that they have added their initial list of
forwardable IPs. This ensures that we do not prune from the kernel
neighbor table until we have an initial set of desired neighbors to
reconcile.

All of these features together eliminate a set of issues that happen
when devices, routes and neighbors change outside of Cilium's control.
We also eliminate issues where some entries were not being refreshed.

PR #37352 made it so backends for services unconditionally get
neighbor entries. This introduced a performance regression due to
the amount of netlink calls to resolve next hops. We have inherited
this performance regression which still exists in this new
implementation.

Strictly speaking we do not need to create the neighbor entries unless
we are running XDP. So lets disable L2 neighbor discovery by default.
When XDP is enabled, we automatically also enable neighbor discovery.
When XDP is disabled, the user can still enable neighbor discovery
by passing the `--enable-l2-neighbor-discovery` flag to the agent.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants