Skip to content

network problem caused by absent reverse entry in snat_v4_external map #29305

@ghost

Description

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

This is a follow-up issue of #27821

This problem still exists in v1.14.4.

Recently I found that the problem can be triggered when lots of connections try to connect some targets with different local ports, steps to reproduced the problem are:

  1. Run a server outside the cluster, listen some ports at same time. Assume the host addr is 99.99.99.99
package main

import (
	"fmt"
	"net"
)

func main() {
	for i := 12345; i < 12355; i++ {
		go Listen(i)
	}
	c := make(chan struct{})
	<-c
}

func Listen(port int) {
	s := fmt.Sprintf("0.0.0.0:%d", port)
	fmt.Printf("start port %d\n", port)
	conn, err := net.Listen("tcp", s)
	if err != nil {
		panic(err)
	}

	for {
		netC, err := conn.Accept()
		if err != nil {
			fmt.Println(err)
		} else {
			fmt.Print("r")
			netC.Close()
		}
	}
}
  1. create a client, put in into image. This client accesses all the listened ports for 1 minutes and sleep.
package main

import (
	"flag"
	"fmt"
	"math/rand"
	"net"
	"strconv"
	"sync/atomic"
	"time"
)

var counter int32 = 0

func main() {
	var max int
	var target string
	var r bool
	flag.IntVar(&max, "max", 1000, "max parallism")
	flag.StringVar(&target, "target", "99.99.99.99", "addr")
	flag.BoolVar(&r, "rand", true, "random dest")
	flag.Parse()
	fmt.Printf("start with %d to %s\n", max, target)
	t := time.Tick(time.Minute)

out:
	for {
		select {
		case <-t:
			break out
		default:
			c := atomic.LoadInt32(&counter)
			if int(c) > max {
				continue
			}
			x := target + ":12345"
			if r {
				i := rand.Intn(10) + 12345
				x = target + ":" + strconv.Itoa(i)
			}

			go accessEndpoint(x)
		}
	}
	fmt.Println("done")
	time.Sleep(time.Hour * 30)
}

func accessEndpoint(target string) {
	atomic.AddInt32(&counter, 1)
	d := net.Dialer{Timeout: time.Second * 10}
	conn, err := d.Dial("tcp", target)
	if err != nil {
		info := ""
		if conn != nil {
			info = conn.LocalAddr().String() //this info is useless 
		}
		fmt.Printf("%s [%s]\n", err, info)
	} else {
		conn.Close()
	}
	atomic.AddInt32(&counter, -1)
}
  1. In our machine, the NAT map size is 131072, which can be set with bpf-nat-global-max

  2. before creating the pod, flush nat map of one node, enter the agent and

cilium bpf nat flush
  1. create the pod on that node
apiVersion: v1
kind: Pod
metadata:
  name: client
  namespace: <ns>
spec:
  containers:
  - image: <image>
    name: <name>
  nodeSelector:
    kubernetes.io/hostname: <node>

the server will continuously print r

  1. after server stop printing, examine the nat map, check the entries that is OUT/egress, but no corresponding IN/ingress reversed nat entry. Get the source port in the OUT item and target addr and port, which can be done with cilium bpf nat list. I create an ebpf map manipulation tool to take care of cilium ebpf map, so I don't put the command here. The steps are like:
 - dump all item |grep addr | grep OUT | awk the nat addr and port > out.txt
 - dump all item |grep addr | grep IN > in.txt
 - for i in `cat out.txt`; do res=`cat in.txt |grep addr |grep $i`; if [ "$res" == "" ]; then echo $i; fi; done
  several items are printed, choose one and then
 - dump all item |grep the choose one |grep OUT
  1. enter the pod, access the target with the local port from 6
nc -vzp <source port> <target addr> <target port>

the command stucks. However If we execute the same command on the node of host pod or if we change another local port, it succeeds.

  1. check the prometheus metrics cilium_datapath_signals_handled_total, get nothing
curl 127.0.0.1:9090/metrics | grep nat_fill_up

Recently I noticed that the comment of #28857 says

// Typically NAT entries should get removed along with their owning CT entry,
// as part of purgeCtEntry*(). But stale NAT entries can get left behind if the
// CT entry disappears for other reasons - for instance by LRU eviction, or
// when the datapath re-purposes the CT entry.

And in this link, someone proved that LRU map can evict item even far from full. And the removed way is not strictly LRU.

I try to change the NAT_MAP_TYPE to BPF_MAP_TYPE_HASH and retry the test, although the snat_nat_fill is sent tens of times, the problem is solved.

In the PurgeOrphanNATEntries() of v1.11.17, only TUPLE_F_IN entries is checked, if TUPLE_F_IN is not exists, the TUPLE_F_OUT item will not be cleared at the time. The PurgeOrphanNATEntries() of v1.14.4 fix this. However if no signal snat_nat_fill is sent, PurgeOrphanNATEntries() will not be invoked, so the problem will last for a period.

Cilium Version

v1.14.4 (and v1.11.17)

Kernel Version

5.10

Kubernetes Version

1.24.3

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.feature/snatRelates to SNAT or Masquerading of traffickind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions