Netkit does not work with endpoint routes in some cases

This tracks a group of issues related to netkit that share a similar root cause and discussion around it.

- In #34042 a user reported that liveness/readiness probes stopped working after applying `NetworkPolicy` to a `Pod` when using `netkit` and endpoint routes. The same configuration using `veth` did not have this problem. Some [investigation](https://github.com/cilium/cilium/issues/34042#issuecomment-2350351085) turned up that packets coming from the host namespace get misidentified with the `world` identity instead of `host`. With `NetworkPolicies` applied to the endpoint this leads to packets from liveness/readiness probes getting dropped.
- In #33875 a user [reported](https://github.com/cilium/cilium/issues/33875#issuecomment-2365156011) that running `curl -I -s https://api.github.com/` from a `Pod` hangs when using `netkit` and endpoint routes after applying a `CiliumNetworkPolicy` which specifies `.spec.egress.toFQDNs`. The same configuration using `veth` or with endpoint routes disabled did not have this problem. Some [investigation](https://github.com/cilium/cilium/issues/33875#issuecomment-2375774287) turned up a similar symptoms, with reply DNS traffic returning from the proxy taking on the `world` identity in `cil_to_container`.

In both cases the root cause looks to be that netkit scrubs packets (clearing `skb->mark`) before executing attached BPF programs. This is evident looking at the source for [`netkit_xmit`](https://github.com/torvalds/linux/blob/075dbe9f6e3c21596c5245826a4ee1f1c1676eb8/drivers/net/netkit.c#L66) where `netkit_prep_forward` clears `skb->mark` when crossing network namespaces before `netkit_run()` runs any attached programs.

```c
	netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
	eth_skb_pkt_type(skb, peer);
	skb->dev = peer;
	entry = rcu_dereference(nk->active);
	if (entry)
		ret = netkit_run(entry, skb, ret);
```

In contrast, when using `veth` TC/TCX hooks are executed *before* the `veth` driver does any packet scrubbing.

#### `netkit` egress processing order
1. [`sch_handle_egress`](https://github.com/torvalds/linux/blob/075dbe9f6e3c21596c5245826a4ee1f1c1676eb8/net/core/dev.c#L4115) is called but does not execute any hooks, since BPF programs are attached directly to the device in `netkit` mode.
2. `netkit_xmit` begins
3. `netkit_xmit` clears `skb->mark`
4. `netkit_xmit` runs `cil_to_container`
5. `netkit_xmit` passes the packet to the peer device

#### `veth` egress processing order
1. [`sch_handle_egress`](https://github.com/torvalds/linux/blob/075dbe9f6e3c21596c5245826a4ee1f1c1676eb8/net/core/dev.c#L4115) is called executing `cil_to_container`.
2. `veth_xmit` begins
3. `veth_xmit` clears `skb->mark`
4. `veth_xmit` passes the packet to the peer device

Since `cil_to_container` uses `ctx->mark` for proxy redirection and policy enforcement this can lead to various issues. For now this only seems to be a problem when using endpoint routes, but if `cilium_host` were to be changed to `netkit` (I'm not sure if this is the case already) it may interfere there as well.

Just to confirm that this is indeed the root cause, patching the netkit driver with this hack locally resolves the reported issues in both cases making behavior consistent between `veth` and `netkit` modes (note: this is not a real fix, just something I used to test my observations).

```
 static void netkit_prep_forward(struct sk_buff *skb, bool xnet)
 {
+       u32 save_mark = skb->mark;
        skb_scrub_packet(skb, xnet);
+       skb->mark = save_mark;
        skb->priority = 0;
        nf_skip_egress(skb, true);
        skb_reset_mac_header(skb);
```

## Tasks
- [ ] Short term, decide whether or not we should allow netkit to be used with endpoint routes enabled. We could add some checks on startup to block this combination. Cons: this might make things stop working for some users who are already using this combination without issue.
- [ ] Document this as a known issue (#35126)
- [ ] Patch netkit. Talking to @borkmann offline, he mentioned maybe adding a new mode to determine if the scrub happens before or after running BPF. If implemented, this needs some follow up work to make sure this mode is configured.
- [ ] Expand test coverage to include some of these scenarios with netkit+endpoint routes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Netkit does not work with endpoint routes in some cases #35060

`netkit` egress processing order

`veth` egress processing order

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Netkit does not work with endpoint routes in some cases #35060

Description

netkit egress processing order

veth egress processing order

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`netkit` egress processing order

`veth` egress processing order