Skip to content

Conversation

dylandreimerink
Copy link
Member

This PR completes the last step in the process of table based device reconciliation. This PR relieves the Daemon of task to trigger reloading of the datapath and regeneration of the endpoints. That job now falls to the datapath orchestrator. The orchestrator implements a reconciler which triggers both the loader reinitialization and then sequentially endpoint regeneration whenever changes in devices are detected. This reconciliation can also be triggered externally, currently by config changes via the API.

The orchestrator explicitly checks for all required information to be available before doing the initial reconciliation on startup. It blocks any calls to the loader which require the datapath to be reinitialized until that is done.

@dylandreimerink dylandreimerink added the release-note/misc This PR makes changes that have no direct user impact. label Jun 10, 2024
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Jun 10, 2024
@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch 3 times, most recently from 2d874a2 to ed47c03 Compare June 12, 2024 09:02
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from ed47c03 to b1632c5 Compare June 12, 2024 11:54
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from b1632c5 to e9f80d5 Compare June 12, 2024 12:22
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from e9f80d5 to fb6aa1b Compare June 12, 2024 15:23
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from fb6aa1b to f844f5d Compare June 13, 2024 15:43
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from f844f5d to 13d499f Compare June 14, 2024 12:07
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from 13d499f to 7f3da6e Compare June 14, 2024 14:04
@dylandreimerink
Copy link
Member Author

/ci-clustermesh

@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch 8 times, most recently from 8df8b83 to 1e59915 Compare June 19, 2024 14:05
@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from 1e59915 to 51d876f Compare June 26, 2024 10:10
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from 51d876f to 41ae10f Compare June 26, 2024 11:11
@dylandreimerink
Copy link
Member Author

/ci-ginkgo

github-merge-queue bot pushed a commit that referenced this pull request Jul 29, 2024
This fixes a rare crash that can occur when a restored endpoint is doing
DNS requests while the first loader Reinitialize() is still not completed
(e.g. waiting for node information).

Crash:
  time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn
  time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon
  ...
  time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu ..
  panic: runtime error: invalid memory address or nil pointer dereference
  [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7]

  goroutine 368 [running]:
  github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...)
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165
  github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808})
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127
  github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?})

The issue is due to WriteEndpointConfig being called via the endpoint DNS
history trigger when the LocalNodeConfiguration is not yet set. Fix the
issue being initializing the trigger from regenerateBPF which is called
only after datapath reinitialize has completed and it is ready to process
the endpoint config writing.

The fix was tested by adding a 5 second sleep into Reinitialize(), both
before the compilation lock and before nodeConfig.Store. This reliably
reproduced the issue and the fix was effective. Adding these sleeps
did not uncover other problems.

A principled long-term fix for this and similar issues lands in #33023
which gates all requests towards the loader and makes sure all relevant
data is present.

Fixes: #34019

Signed-off-by: Jussi Maki <jussi@isovalent.com>
@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch 3 times, most recently from 45f29a2 to fb8e517 Compare July 29, 2024 12:27
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from fb8e517 to 103c60a Compare July 29, 2024 12:48
@dylandreimerink dylandreimerink requested a review from joamaki July 30, 2024 09:00
Copy link
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@dylandreimerink
Copy link
Member Author

/test

nbusseneau pushed a commit that referenced this pull request Aug 3, 2024
[ upstream commit 258819d ]

This fixes a rare crash that can occur when a restored endpoint is doing
DNS requests while the first loader Reinitialize() is still not completed
(e.g. waiting for node information).

Crash:
  time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn
  time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon
  ...
  time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu ..
  panic: runtime error: invalid memory address or nil pointer dereference
  [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7]

  goroutine 368 [running]:
  github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...)
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165
  github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808})
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127
  github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?})

The issue is due to WriteEndpointConfig being called via the endpoint DNS
history trigger when the LocalNodeConfiguration is not yet set. Fix the
issue being initializing the trigger from regenerateBPF which is called
only after datapath reinitialize has completed and it is ready to process
the endpoint config writing.

The fix was tested by adding a 5 second sleep into Reinitialize(), both
before the compilation lock and before nodeConfig.Store. This reliably
reproduced the issue and the fix was effective. Adding these sleeps
did not uncover other problems.

A principled long-term fix for this and similar issues lands in #33023
which gates all requests towards the loader and makes sure all relevant
data is present.

Fixes: #34019

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit that referenced this pull request Aug 3, 2024
[ upstream commit 258819d ]

This fixes a rare crash that can occur when a restored endpoint is doing
DNS requests while the first loader Reinitialize() is still not completed
(e.g. waiting for node information).

Crash:
  time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn
  time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon
  ...
  time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu ..
  panic: runtime error: invalid memory address or nil pointer dereference
  [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7]

  goroutine 368 [running]:
  github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...)
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165
  github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808})
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127
  github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?})

The issue is due to WriteEndpointConfig being called via the endpoint DNS
history trigger when the LocalNodeConfiguration is not yet set. Fix the
issue being initializing the trigger from regenerateBPF which is called
only after datapath reinitialize has completed and it is ready to process
the endpoint config writing.

The fix was tested by adding a 5 second sleep into Reinitialize(), both
before the compilation lock and before nodeConfig.Store. This reliably
reproduced the issue and the fix was effective. Adding these sleeps
did not uncover other problems.

A principled long-term fix for this and similar issues lands in #33023
which gates all requests towards the loader and makes sure all relevant
data is present.

Fixes: #34019

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Copy link
Contributor

@rgo3 rgo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

gandro pushed a commit that referenced this pull request Aug 6, 2024
[ upstream commit 258819d ]

This fixes a rare crash that can occur when a restored endpoint is doing
DNS requests while the first loader Reinitialize() is still not completed
(e.g. waiting for node information).

Crash:
  time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn
  time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon
  ...
  time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu ..
  panic: runtime error: invalid memory address or nil pointer dereference
  [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7]

  goroutine 368 [running]:
  github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...)
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165
  github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808})
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127
  github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?})

The issue is due to WriteEndpointConfig being called via the endpoint DNS
history trigger when the LocalNodeConfiguration is not yet set. Fix the
issue being initializing the trigger from regenerateBPF which is called
only after datapath reinitialize has completed and it is ready to process
the endpoint config writing.

The fix was tested by adding a 5 second sleep into Reinitialize(), both
before the compilation lock and before nodeConfig.Store. This reliably
reproduced the issue and the fix was effective. Adding these sleeps
did not uncover other problems.

A principled long-term fix for this and similar issues lands in #33023
which gates all requests towards the loader and makes sure all relevant
data is present.

Fixes: #34019

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
dylandreimerink and others added 6 commits August 7, 2024 13:47
Remove the device reloader, the daemon is the only component left that
uses it. The daemon triggered all endpoints to regenerate, we will move
this logic to the orchestrator. So the device reloader is no longer
needed.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
The Chdir() done by Daemon.init was racing with the loader causing
just compiled object files to not be found. Change the Chdir() in
initEnv() to switch to StateDir directly and remove the Chdir() from
Daemon.init.

Co-authored-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
This commits makes the LocalNodeConfig struct DeepEqual checkable.
Since `[]*net.IPNet` is not DeepEqual checkable, we needed to replace
it with `[]*cidr.CIDR` which is a wrapper that adds the correct methods.

We will use the DeepEqual check in later commits to detect if changes in
the tables had any effect on the LocalNodeConfig generated from it. This
will reduce the amount of needless reconciliation churn.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
This commit changes the node.{Get,Set}IPv4Loopback methods to read from
or write to the current local node in the local node store. This allows
cells that depend on GetIPv4Loopback to just look at the local node
changes instead of having to poll the global functions for change.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
This commit implements full reconciliation of the loader state based on
the table state. Before this commit the "device reloader" would send a
signal to the daemon and the daemon would trigger reconciliation of the
loader and endpoints.

We are now moving this logic into its appropriate layer. The
orchestrator will now be responsible for triggering the reconciliation
of the loader and endpoints based on changes in the device table. In
addition, the orchestrator can also be triggered externally to handle
cases where configuration has changed which we can't subscribe to yet
via tables.

The daemon used to do a lot of implicit ordering. Since the orchestrator
is a proper hive cell, we have to be more explicit about it. Some of the
information needed to create the `LocalNodeConfig` only becomes
available some time after the initial startup. The orchestrator
starts a job which will wait for the information to become available,
then to the initial reconciliation of the datapath via the loader. Only
after that is done will the initial reconciliation of all known endpoints
be requested.

We now plumb all calls that depend on the loader initialization through
the orchestrator. These calls will block until the loader is done
initializing the datapath.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
Previously, the loader was part of the Datapath interface. Since we
are no longer using a datapath types interface to abstract the loader
anymore, it seems logical to not provide the loader via the datapath
interface.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
@dylandreimerink dylandreimerink force-pushed the feature/loader-reconciliation-6 branch from 103c60a to 4f869a6 Compare August 7, 2024 11:47
@dylandreimerink
Copy link
Member Author

/test

@tklauser tklauser added this pull request to the merge queue Aug 7, 2024
@tklauser tklauser removed the request for review from doniacld August 7, 2024 15:00
Merged via the queue into cilium:main with commit 198ae3a Aug 7, 2024
65 of 66 checks passed
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants