-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
This is a follow-up issue of #27821
This problem still exists in v1.14.4.
Recently I found that the problem can be triggered when lots of connections try to connect some targets with different local ports, steps to reproduced the problem are:
- Run a server outside the cluster, listen some ports at same time. Assume the host addr is 99.99.99.99
package main
import (
"fmt"
"net"
)
func main() {
for i := 12345; i < 12355; i++ {
go Listen(i)
}
c := make(chan struct{})
<-c
}
func Listen(port int) {
s := fmt.Sprintf("0.0.0.0:%d", port)
fmt.Printf("start port %d\n", port)
conn, err := net.Listen("tcp", s)
if err != nil {
panic(err)
}
for {
netC, err := conn.Accept()
if err != nil {
fmt.Println(err)
} else {
fmt.Print("r")
netC.Close()
}
}
}
- create a client, put in into image. This client accesses all the listened ports for 1 minutes and sleep.
package main
import (
"flag"
"fmt"
"math/rand"
"net"
"strconv"
"sync/atomic"
"time"
)
var counter int32 = 0
func main() {
var max int
var target string
var r bool
flag.IntVar(&max, "max", 1000, "max parallism")
flag.StringVar(&target, "target", "99.99.99.99", "addr")
flag.BoolVar(&r, "rand", true, "random dest")
flag.Parse()
fmt.Printf("start with %d to %s\n", max, target)
t := time.Tick(time.Minute)
out:
for {
select {
case <-t:
break out
default:
c := atomic.LoadInt32(&counter)
if int(c) > max {
continue
}
x := target + ":12345"
if r {
i := rand.Intn(10) + 12345
x = target + ":" + strconv.Itoa(i)
}
go accessEndpoint(x)
}
}
fmt.Println("done")
time.Sleep(time.Hour * 30)
}
func accessEndpoint(target string) {
atomic.AddInt32(&counter, 1)
d := net.Dialer{Timeout: time.Second * 10}
conn, err := d.Dial("tcp", target)
if err != nil {
info := ""
if conn != nil {
info = conn.LocalAddr().String() //this info is useless
}
fmt.Printf("%s [%s]\n", err, info)
} else {
conn.Close()
}
atomic.AddInt32(&counter, -1)
}
-
In our machine, the NAT map size is 131072, which can be set with
bpf-nat-global-max
-
before creating the pod, flush nat map of one node, enter the agent and
cilium bpf nat flush
- create the pod on that node
apiVersion: v1
kind: Pod
metadata:
name: client
namespace: <ns>
spec:
containers:
- image: <image>
name: <name>
nodeSelector:
kubernetes.io/hostname: <node>
the server will continuously print r
- after server stop printing, examine the nat map, check the entries that is OUT/egress, but no corresponding IN/ingress reversed nat entry. Get the source port in the OUT item and target addr and port, which can be done with
cilium bpf nat list
. I create an ebpf map manipulation tool to take care of cilium ebpf map, so I don't put the command here. The steps are like:
- dump all item |grep addr | grep OUT | awk the nat addr and port > out.txt
- dump all item |grep addr | grep IN > in.txt
- for i in `cat out.txt`; do res=`cat in.txt |grep addr |grep $i`; if [ "$res" == "" ]; then echo $i; fi; done
several items are printed, choose one and then
- dump all item |grep the choose one |grep OUT
- enter the pod, access the target with the local port from 6
nc -vzp <source port> <target addr> <target port>
the command stucks. However If we execute the same command on the node of host pod or if we change another local port, it succeeds.
- check the prometheus metrics
cilium_datapath_signals_handled_total
, get nothing
curl 127.0.0.1:9090/metrics | grep nat_fill_up
Recently I noticed that the comment of #28857 says
// Typically NAT entries should get removed along with their owning CT entry,
// as part of purgeCtEntry*(). But stale NAT entries can get left behind if the
// CT entry disappears for other reasons - for instance by LRU eviction, or
// when the datapath re-purposes the CT entry.
And in this link, someone proved that LRU map can evict item even far from full. And the removed way is not strictly LRU.
I try to change the NAT_MAP_TYPE
to BPF_MAP_TYPE_HASH
and retry the test, although the snat_nat_fill
is sent tens of times, the problem is solved.
In the PurgeOrphanNATEntries()
of v1.11.17, only TUPLE_F_IN
entries is checked, if TUPLE_F_IN
is not exists, the TUPLE_F_OUT
item will not be cleared at the time. The PurgeOrphanNATEntries()
of v1.14.4 fix this. However if no signal snat_nat_fill
is sent, PurgeOrphanNATEntries()
will not be invoked, so the problem will last for a period.
Cilium Version
v1.14.4 (and v1.11.17)
Kernel Version
5.10
Kubernetes Version
1.24.3
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct