-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Table based loader reconciliation #33023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
tklauser
merged 6 commits into
cilium:main
from
dylandreimerink:feature/loader-reconciliation-6
Aug 7, 2024
Merged
Table based loader reconciliation #33023
tklauser
merged 6 commits into
cilium:main
from
dylandreimerink:feature/loader-reconciliation-6
Aug 7, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2d874a2
to
ed47c03
Compare
/test |
ed47c03
to
b1632c5
Compare
/test |
b1632c5
to
e9f80d5
Compare
/test |
e9f80d5
to
fb6aa1b
Compare
/test |
fb6aa1b
to
f844f5d
Compare
/test |
f844f5d
to
13d499f
Compare
/test |
13d499f
to
7f3da6e
Compare
/ci-clustermesh |
8df8b83
to
1e59915
Compare
1e59915
to
51d876f
Compare
/test |
51d876f
to
41ae10f
Compare
/ci-ginkgo |
joamaki
requested changes
Jul 29, 2024
github-merge-queue bot
pushed a commit
that referenced
this pull request
Jul 29, 2024
This fixes a rare crash that can occur when a restored endpoint is doing DNS requests while the first loader Reinitialize() is still not completed (e.g. waiting for node information). Crash: time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon ... time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu .. panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7] goroutine 368 [running]: github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...) /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165 github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808}) /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127 github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?}) The issue is due to WriteEndpointConfig being called via the endpoint DNS history trigger when the LocalNodeConfiguration is not yet set. Fix the issue being initializing the trigger from regenerateBPF which is called only after datapath reinitialize has completed and it is ready to process the endpoint config writing. The fix was tested by adding a 5 second sleep into Reinitialize(), both before the compilation lock and before nodeConfig.Store. This reliably reproduced the issue and the fix was effective. Adding these sleeps did not uncover other problems. A principled long-term fix for this and similar issues lands in #33023 which gates all requests towards the loader and makes sure all relevant data is present. Fixes: #34019 Signed-off-by: Jussi Maki <jussi@isovalent.com>
45f29a2
to
fb8e517
Compare
/test |
fb8e517
to
103c60a
Compare
joamaki
approved these changes
Jul 30, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
/test |
nbusseneau
pushed a commit
that referenced
this pull request
Aug 3, 2024
[ upstream commit 258819d ] This fixes a rare crash that can occur when a restored endpoint is doing DNS requests while the first loader Reinitialize() is still not completed (e.g. waiting for node information). Crash: time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon ... time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu .. panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7] goroutine 368 [running]: github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...) /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165 github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808}) /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127 github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?}) The issue is due to WriteEndpointConfig being called via the endpoint DNS history trigger when the LocalNodeConfiguration is not yet set. Fix the issue being initializing the trigger from regenerateBPF which is called only after datapath reinitialize has completed and it is ready to process the endpoint config writing. The fix was tested by adding a 5 second sleep into Reinitialize(), both before the compilation lock and before nodeConfig.Store. This reliably reproduced the issue and the fix was effective. Adding these sleeps did not uncover other problems. A principled long-term fix for this and similar issues lands in #33023 which gates all requests towards the loader and makes sure all relevant data is present. Fixes: #34019 Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau
pushed a commit
that referenced
this pull request
Aug 3, 2024
[ upstream commit 258819d ] This fixes a rare crash that can occur when a restored endpoint is doing DNS requests while the first loader Reinitialize() is still not completed (e.g. waiting for node information). Crash: time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon ... time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu .. panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7] goroutine 368 [running]: github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...) /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165 github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808}) /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127 github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?}) The issue is due to WriteEndpointConfig being called via the endpoint DNS history trigger when the LocalNodeConfiguration is not yet set. Fix the issue being initializing the trigger from regenerateBPF which is called only after datapath reinitialize has completed and it is ready to process the endpoint config writing. The fix was tested by adding a 5 second sleep into Reinitialize(), both before the compilation lock and before nodeConfig.Store. This reliably reproduced the issue and the fix was effective. Adding these sleeps did not uncover other problems. A principled long-term fix for this and similar issues lands in #33023 which gates all requests towards the loader and makes sure all relevant data is present. Fixes: #34019 Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
rgo3
approved these changes
Aug 4, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
gandro
pushed a commit
that referenced
this pull request
Aug 6, 2024
[ upstream commit 258819d ] This fixes a rare crash that can occur when a restored endpoint is doing DNS requests while the first loader Reinitialize() is still not completed (e.g. waiting for node information). Crash: time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon ... time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu .. panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7] goroutine 368 [running]: github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...) /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165 github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808}) /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127 github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?}) The issue is due to WriteEndpointConfig being called via the endpoint DNS history trigger when the LocalNodeConfiguration is not yet set. Fix the issue being initializing the trigger from regenerateBPF which is called only after datapath reinitialize has completed and it is ready to process the endpoint config writing. The fix was tested by adding a 5 second sleep into Reinitialize(), both before the compilation lock and before nodeConfig.Store. This reliably reproduced the issue and the fix was effective. Adding these sleeps did not uncover other problems. A principled long-term fix for this and similar issues lands in #33023 which gates all requests towards the loader and makes sure all relevant data is present. Fixes: #34019 Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
tommyp1ckles
approved these changes
Aug 6, 2024
Remove the device reloader, the daemon is the only component left that uses it. The daemon triggered all endpoints to regenerate, we will move this logic to the orchestrator. So the device reloader is no longer needed. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
The Chdir() done by Daemon.init was racing with the loader causing just compiled object files to not be found. Change the Chdir() in initEnv() to switch to StateDir directly and remove the Chdir() from Daemon.init. Co-authored-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
This commits makes the LocalNodeConfig struct DeepEqual checkable. Since `[]*net.IPNet` is not DeepEqual checkable, we needed to replace it with `[]*cidr.CIDR` which is a wrapper that adds the correct methods. We will use the DeepEqual check in later commits to detect if changes in the tables had any effect on the LocalNodeConfig generated from it. This will reduce the amount of needless reconciliation churn. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
This commit changes the node.{Get,Set}IPv4Loopback methods to read from or write to the current local node in the local node store. This allows cells that depend on GetIPv4Loopback to just look at the local node changes instead of having to poll the global functions for change. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
This commit implements full reconciliation of the loader state based on the table state. Before this commit the "device reloader" would send a signal to the daemon and the daemon would trigger reconciliation of the loader and endpoints. We are now moving this logic into its appropriate layer. The orchestrator will now be responsible for triggering the reconciliation of the loader and endpoints based on changes in the device table. In addition, the orchestrator can also be triggered externally to handle cases where configuration has changed which we can't subscribe to yet via tables. The daemon used to do a lot of implicit ordering. Since the orchestrator is a proper hive cell, we have to be more explicit about it. Some of the information needed to create the `LocalNodeConfig` only becomes available some time after the initial startup. The orchestrator starts a job which will wait for the information to become available, then to the initial reconciliation of the datapath via the loader. Only after that is done will the initial reconciliation of all known endpoints be requested. We now plumb all calls that depend on the loader initialization through the orchestrator. These calls will block until the loader is done initializing the datapath. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
Previously, the loader was part of the Datapath interface. Since we are no longer using a datapath types interface to abstract the loader anymore, it seems logical to not provide the loader via the datapath interface. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
103c60a
to
4f869a6
Compare
/test |
tklauser
approved these changes
Aug 7, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/misc
This PR makes changes that have no direct user impact.
sig/policy
Impacts whether traffic is allowed or denied based on user-defined policies.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR completes the last step in the process of table based device reconciliation. This PR relieves the Daemon of task to trigger reloading of the datapath and regeneration of the endpoints. That job now falls to the datapath orchestrator. The orchestrator implements a reconciler which triggers both the loader reinitialization and then sequentially endpoint regeneration whenever changes in devices are detected. This reconciliation can also be triggered externally, currently by config changes via the API.
The orchestrator explicitly checks for all required information to be available before doing the initial reconciliation on startup. It blocks any calls to the loader which require the datapath to be reinitialized until that is done.