-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Problem
The current neighbor subsystem was made for nodes, and later extended to services. While it works in the majority of cases, there are some situations where it can fail. It was built in a time where we did not have the same reconciliation capabilities as we do now.
Here is a list of known issues:
- We currently do not detect when neighbor entries go away, which can happen when an interface goes down or when manually deleted.
- Our neighbor table entries are based on the next hop of an address, however, if routes change we do not update these entries.
- On pre-v5.16 kernels, the kernel does not automatically refresh stale externally managed neighbor entries. We implement a mechanism to do this, however, it lives at the node manager level. This means only entries for nodes are refreshed, but not those for services.
To fix all of the above, I propose to rework the architecture of the neighbor subsystem.
New neighbor subsystem
The overarching goal for this subsystem is to take a set of "Forwardable IP"s and ensure the kernels neighbor table always has fresh entries for them. This ensures that XDP programs will always have be able to forward traffic for a given IP using the kernels FIB.
These "Forwardable IP"s will be stored in a StateDB table. Each entry having an IP (netip.Addr
), and a list of "owners". An owner being any other component of Cilium which wants to ensure that the IP is forwardable (For now node manager and services).
The neighbor entries are actually the next hops for the "Forwardable IP"s for every "selected device" that is also an L2 device. We can get these from the statedb.Table[*tables.Device]
table, as well as subscribe to changes. The most reliable way to get a next hop for a given IP is to query the FIB using netlink.RouteGetWithOptions
. However, doing netlink requests is expensive, and we have no direct way of getting notified when the result changes. Therefor we use the statedb.Table[*tables.Route]
table as a proxy for the FIB, but only for getting a signal on change, we will still use netlink.RouteGetWithOptions
to get the actual next hop. All of this results in a desired state which we will store in a new table statedb.Table[DesiredNeighbor]
.
A reconciler will watch both the statedb.Table[DesiredNeighbor]
and statedb.Table[*tables.Neighbor]
tables to detect when the actual state and desired state deviate. When they do, the reconciler will update the kernel neighbor table to match the desired state. Table initializers at every level will ensure the first reconciliation after agent startup is done once all tables are populated to ensure we do not prune any existing entries that may actually be in use.
On kernels earlier than v5.16, the kernel does not automatically refresh stale externally managed neighbor entries. We have probes to detect this. On these older kernels, an additional task of the reconciler is to re-insert entries that have gone stale. There are three possible mechanisms for this, which the implementor will have to choose from:
- Use a separate job to periodically mark the entries such that the reconciler will re-insert them.
- Use the existing refresh logic in the generic reconciler config.RefreshInterval
- Use the reported state from
statedb.Table[*tables.Neighbor]
to see when the kernel marks an entry as stale, and then re-insert it.