-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
There appears to be a race condition causing endpoints whose regeneration process has been retried due to a temporary error to end up being associated with an empty policy map, eventually causing all traffic from/to them to be dropped.
How can we reproduce the issue?
Reproduced on latest main as of today (2ab44c2), applying a patch [1] on top to simulate an error during bpf collection loading.
Created a simple development cluster with make kind && make kind-image && make kind-install-cilium
, and deployed a test application with k create deploy podinfo --image=stefanprodan/podinfo --replicas=5
.
Eventually, all endpoints turned ready as expected, but a few appear to be associated with an empty policy map [2]. Any traffic from/to the affected pods gets subsequently dropped.
$ kgpwide podinfo-849bfb5c8d-l99dx
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
podinfo-849bfb5c8d-l99dx 1/1 Running 0 25m 10.244.0.247 kind-control-plane <none> <none>
$ cilium endpoint list | grep 3066
3066 Disabled Disabled 3718 k8s:app=podinfo 10:244::f9c7 10.244.0.247 ready
$ cilium bpf policy get 3066
POLICY DIRECTION LABELS (source:key[=value]) PORT/PROTO PROXY PORT AUTH TYPE BYTES PACKETS PREFIX
Policy stats empty. Perhaps the policy enforcement is disabled?
$ hubble observe -f --namespace=default
Apr 17 08:31:52.982: default/podinfo-849bfb5c8d-l99dx:41072 (ID:3718) <> 10.96.0.10:53 (world-ipv4) from-endpoint FORWARDED (UDP)
Apr 17 08:31:52.982: default/podinfo-849bfb5c8d-l99dx:41072 (ID:3718) <> kube-system/coredns-6f6b679f8f-mxnj2:53 (ID:60937) Policy denied DROPPED (UDP)
[1]:
diff --git a/pkg/bpf/collection.go b/pkg/bpf/collection.go
index 5c56c437a7ac..ca7ff284423b 100644
--- a/pkg/bpf/collection.go
+++ b/pkg/bpf/collection.go
@@ -8,6 +8,7 @@ import (
"errors"
"fmt"
"os"
+ "sync/atomic"
"github.com/cilium/ebpf"
"github.com/cilium/ebpf/asm"
@@ -264,6 +265,8 @@ type CollectionOptions struct {
MapRenames map[string]string
}
+var count atomic.Int32
+
// LoadCollection loads the given spec into the kernel with the specified opts.
// Returns a function that must be called after the Collection's entrypoints are
// attached to their respective kernel hooks. This function commits pending map
@@ -315,6 +318,9 @@ func LoadCollection(spec *ebpf.CollectionSpec, opts *CollectionOptions) (*ebpf.C
// Attempt to load the Collection.
coll, err := ebpf.NewCollectionWithOptions(spec, opts.CollectionOptions)
+ if count.Add(1) < 5 {
+ return nil, nil, fmt.Errorf("test error: %w", os.ErrExist)
+ }
// Collect key names of maps that are not compatible with their pinned
// counterparts and remove their pinning flags.
[2]: to be precise, the issue did not happen upon first deployment, but after restarting the Cilium agents. I don't think that this is a precondition though (as I've observed the same issue without restarting the agents), but rather indicates that this is probably caused by a race condition that does not always manifest.
Sysdump
cilium-sysdump-20250417-103605.zip
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
Type
Projects
Status