Skip to content

Conversation

Apokleos
Copy link
Contributor

To well support vfio devices in PCIe root ports or switch ports in case of qemu-rs, we instroduce port devices containing RootPort and SwitchPort mostly aligned with the implementations of kata-runtime.

@Apokleos Apokleos requested review from BbolroC, pmores, zvonkok and gkurz and removed request for BbolroC November 25, 2024 09:28
@katacontainersbot katacontainersbot added the size/huge Largest and most complex task (probably needs breaking into small pieces) label Nov 25, 2024
@Apokleos Apokleos requested a review from lifupan November 25, 2024 09:32
@Apokleos Apokleos requested a review from justxuewei November 25, 2024 12:10
Copy link
Member

@BbolroC BbolroC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks great to me! I've left a few minor questions.
Since I'm not an expert in this area, I would like to delegate the approval to @gkurz and @pmores. Thanks a lot!

@zvonkok
Copy link
Contributor

zvonkok commented Dec 4, 2024

There are limits per HV how many root/switch ports can be created. We need to make sure that we're not exceeding those limits otherwise we get cryptic memory errors.

For QEMU we have a limit of 16 root ports or 16 switch ports or 64KB IO memory each root or switch port taking up of default 4K IO memory.

@zvonkok
Copy link
Contributor

zvonkok commented Dec 4, 2024

@BbolroC One outstanding thing to clarify is can we attach VFIO-AP devices to a root-port? This way we can drop the support for legacy bridges altogether.

@Apokleos
Copy link
Contributor Author

Apokleos commented Dec 5, 2024

There are limits per HV how many root/switch ports can be created. We need to make sure that we're not exceeding those limits otherwise we get cryptic memory errors.

For QEMU we have a limit of 16 root ports or 16 switch ports or 64KB IO memory each root or switch port taking up of default 4K IO memory.

Yes, sir. without KEP-4113 support, we need introduce many related codes to do such work which seems hacking to me.
But I am thinking how to address this problem and will make it be more friendly for the future KEP-4113 or sandbox API.

I have pulled request for get bar max memory in another PR. And I am looking forward to your INPUT @zvonkok @BbolroC

@Apokleos Apokleos force-pushed the pcie-port-devices branch 3 times, most recently from 12ffcc8 to 31a665b Compare December 6, 2024 12:46
@Apokleos
Copy link
Contributor Author

Apokleos commented Dec 6, 2024

I have tested it with related PRs and get the SWITHCH PORT output as below:

configuration

hotplug_vfio_on_root_bus = true
pcie_switch_port = 4
~/kata-cdi/apokleos/kata-containers/src/runtime-rs# ctr run -t --device /dev/vfio/75 --device /dev/vfio/76 --device /dev/vfio/214 --device /dev/vfio/215 --rm --runtime io.containerd.kata.v2 "$image" debug0456789 /bin/bash
[root@localhost /]# lspci -tvnn
-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller [8086:29c0]
           +-01.0  Red Hat, Inc. Virtio RNG [1af4:1005]
           +-02.0-[01]--
           +-03.0-[02]--
           +-04.0  Red Hat, Inc. Virtio SCSI [1af4:1004]
           +-05.0  Red Hat, Inc. Virtio socket [1af4:1053]
           +-06.0  Red Hat, Inc. Virtio file system [1af4:105a]
           +-07.0-[03-08]----00.0-[04-08]--+-00.0-[05]----00.0  NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8]
           |                               +-01.0-[06]----00.0  NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8]
           |                               +-02.0-[07]----00.0  NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8]
           |                               \-03.0-[08]----00.0  NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8]
           +-08.0  Red Hat, Inc. Virtio console [1af4:1003]
           +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller [8086:2918]
           +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922]
           \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930]

get the root ports

           +-07.0-[03]--
           +-08.0-[04]--

the shell command

configuration

...
hotplug_vfio_on_root_bus = true

# Before hot plugging a PCIe device, you need to add a pcie_root_port device.
# Use this parameter when using some large PCI bar devices, such as Nvidia GPU
# The value means the number of pcie_root_port
# This value is valid when hotplug_vfio_on_root_bus is true and machine_type is "q35"
# Default 0
pcie_root_port = 2
[root@localhost /]# lspci -tvnn
-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller [8086:29c0]
           +-01.0  Red Hat, Inc. Virtio RNG [1af4:1005]
           +-02.0-[01]--
           +-03.0-[02]--
           +-04.0  Red Hat, Inc. Virtio SCSI [1af4:1004]
           +-05.0  Red Hat, Inc. Virtio socket [1af4:1053]
           +-06.0  Red Hat, Inc. Virtio file system [1af4:105a]
           +-07.0-[03]--
           +-08.0-[04]--
           +-09.0  Red Hat, Inc. Virtio console [1af4:1003]
           +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller [8086:2918]
           +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922]
           \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930]
[root@localhost /]# 

@Apokleos Apokleos requested a review from BbolroC December 6, 2024 13:01
@BbolroC
Copy link
Member

BbolroC commented Dec 11, 2024

@BbolroC One outstanding thing to clarify is can we attach VFIO-AP devices to a root-port? This way we can drop the support for legacy bridges altogether.

Please take a look at the following:

Based on my understanding so far, there is no PCI support in s390x QEMU (so no concept like a root/switch port) and a bridge info is not required for VFIO-AP hot-plugging. I will raise a PR for VFIO-AP hot-plugging after this PR and #10362 are merged. Thanks.

@zvonkok
Copy link
Contributor

zvonkok commented Dec 11, 2024

hotplug_vfio_on_root_bus = true this is a very old unsupported option. We removed that some time ago. We should use

hot_plug_vfio = "root-port"
cold_plug_vfio = "root-port"

@zvonkok
Copy link
Contributor

zvonkok commented Dec 11, 2024

@Apokleos For the switch-port use-case I need to post a QEMU patch. Can you check if the BARs are correctly mapped?

dmesg | grep BAR

@Apokleos
Copy link
Contributor Author

@BbolroC One outstanding thing to clarify is can we attach VFIO-AP devices to a root-port? This way we can drop the support for legacy bridges altogether.

Please take a look at the following:

Based on my understanding so far, there is no PCI support in s390x QEMU (so no concept like a root/switch port) and a bridge info is not required for VFIO-AP hot-plugging. I will raise a PR for VFIO-AP hot-plugging after this PR and #10362 are merged. Thanks.

Thx Choi @BbolroC, I'd like to know that there's some work in my PR I can do for you which can help your coming works ? If any ideas, please let me know.

@Apokleos
Copy link
Contributor Author

hotplug_vfio_on_root_bus = true this is a very old unsupported option. We removed that some time ago. We should use

hot_plug_vfio = "root-port"
cold_plug_vfio = "root-port"

Sure. the coming updated PR will remove this option.

@zvonkok
Copy link
Contributor

zvonkok commented Dec 12, 2024

@Apokleos I am wondering how we're dealing with different archs, in the go-runtime we have different implementation for different architectures like x86 - s390x how are we dealing it with in runtime-rs?
A good example is that in x86 we do not need the bridge but s390x needs it. So we shouldn't introduce the bridge in x86 but only have it enabled in s390x.


/// Represents an available node in the PCIe topology.
#[derive(Clone, Debug)]
pub enum AvailabledNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be AvailableNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx Pavel. I think it should not be here now, I need remove this data structure as there's no reference of it in this PR. And I will add it back in a related PR(vfio device hotplug).

Copy link
Contributor

@pmores pmores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Apokleos for this huge piece of work in a well-formatted PR! :-) I'm approving but frankly I'd prefer if you could deal with some of the remaining suggestions by @gkurz and @BbolroC either by incorporating them and resolving the pertinent conversations, or by documenting your reasons why not to incorporate them.

I've resolved my conversations for clarity that you've dealt with.

@Apokleos Apokleos added the do-not-merge PR has problems or depends on another label Apr 27, 2025
@Apokleos
Copy link
Contributor Author

Thanks @Apokleos for this huge piece of work in a well-formatted PR! :-) I'm approving but frankly I'd prefer if you could deal with some of the remaining suggestions by @gkurz and @BbolroC either by incorporating them and resolving the pertinent conversations, or by documenting your reasons why not to incorporate them.

I've resolved my conversations for clarity that you've dealt with.

Thx Pavel @pmores.
Yes, absolutely I will do. As I have promissed @zvonkok I will clean up these left comments.

@Apokleos
Copy link
Contributor Author

hotplug_vfio_on_root_bus = true this is a very old unsupported option. We removed that some time ago. We should use

hot_plug_vfio = "root-port"
cold_plug_vfio = "root-port"

I'd like to create an issue to do this work

hotplug_vfio_on_root_bus = true this is a very old unsupported option. We removed that some time ago. We should use

hot_plug_vfio = "root-port"
cold_plug_vfio = "root-port"

I think we should address it in another PR, ahead of it, I open an issue to track it

@Apokleos Apokleos added do-not-merge PR has problems or depends on another and removed do-not-merge PR has problems or depends on another labels Apr 27, 2025
@Apokleos
Copy link
Contributor Author

Hi @zvonkok @pmores @BbolroC I have updated the PR, and try to address the left comments. Is there any stuff I need address before it's merged ? It is with do-not-merge before your ACK.

@BbolroC
Copy link
Member

BbolroC commented Apr 28, 2025

Thanks for the update. I am fine with merging this. 👍

@pmores
Copy link
Contributor

pmores commented Apr 28, 2025

Hey @Apokleos , I'd still be in favour of mentioning the root/switch port counts limitation in the config file template but I'm fine merging this with or without that change. Thanks!

@zvonkok
Copy link
Contributor

zvonkok commented Apr 28, 2025

@Apokleos What happens if a VFIO device is hot-plugged to a root-port and the workload fails to start and the container is restarted. Is the root-port freed up? Can it be reused once he container restarts?

@pmores
Copy link
Contributor

pmores commented Apr 29, 2025

@zvonkok I don't think this PR deals with hotplugging at all.

@Apokleos
Copy link
Contributor Author

Hey @Apokleos , I'd still be in favour of mentioning the root/switch port counts limitation in the config file template but I'm fine merging this with or without that change. Thanks!

Thx Pavel @pmores
I hope my explanation makes sense. When I initially designed this, the primary goal was to align with the runtime-go's implementation to ensure functional consistency.

Secondly, it's aimed for a simpler pcie topology that could still support advanced features for devices like GPUs. It's what the limitation only one port device type(RootPort or SwitchPort) exists in a pod/VM.

I also spent some time considering if there were scenarios requiring both a dedicated root port and a combined root/switch port or called switch port, but couldn't identify any clear use cases. That's what led to the current design.

However, I'm certainly open to making adjustments in future implementations if such scenarios do arise.

@Apokleos
Copy link
Contributor Author

@Apokleos What happens if a VFIO device is hot-plugged to a root-port and the workload fails to start and the container is restarted. Is the root-port freed up? Can it be reused once he container restarts?

Thx @zvonkok This PR will not involve the vfio device hotplugging stuff, and the following one will do. But IMO, it's a point worth discussing. As I understand, continuously hot-plugging devices from VM isn't very reliable at the moment, so we might need to have a deep discussion about it based on the PR #10362.

@Apokleos Apokleos removed the do-not-merge PR has problems or depends on another label May 15, 2025
Apokleos added 9 commits May 15, 2025 20:10
(1) Introduce new field `pcie_switch_port` for switch ports.
(2) Add related checking logics in vmms(dragonball, qemu)

Fixes kata-containers#10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Support setting switch ports with annotatation or configuration.toml

Fixes kata-containers#10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This commit introduces an implementation for managing PCIe topologies,
focusing on the relationship between Root Ports and Switch Ports. The
design supports two strategies for generating Switch Ports:

Let's take the requirement of 4 switch ports as an example. There'll be
three possible solutions as below:
(1) Single Root Port + Single PCIe Switch: Uses 1 Root Port and 1 Switch
with 4 Downstream Ports.
(2) Multiple Root Ports + Multiple PCIe Switches: Uses 2 Root Ports and
2 Switches, each with 2 Downstream Ports.

The recommended strategy is Option 1 due to its simplicity, efficiency,
and scalability. The implementation includes data structures
(PcieTopology, RootPort, PcieSwitch, SwitchPort) and operations
(add_pcie_root_port, add_switch_to_root_port, add_switch_port_to_switch)
to manage the topology effectively.

Fxies kata-containers#10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
PortDevice is for handling root ports or switch ports in PCIe
Topology. It will make it easy pass the root ports/switch ports
information during create VM with requirements of PCIe devices.

Fixes kata-containers#10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
A new resource type `PortDevice` is introduced which is dedicated
for handling root ports/switch ports during sandbox creation(VM).

Fixes kata-containers#10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Prepare pcie port devices before starting VM with the help of
device manager and PCIe Topology.

Fixes kata-containers#10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Some data structures and methods are introduced to help handle vfio devices.
And mothods add_pcie_root_ports and add_pcie_switch_ports follow runtime's
related implementations of vfio devices.

Fixes kata-containers#10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Extract PortDevice relevant information, and then invoke different
processing methods based on the device type.

Fixes kata-containers#10361

Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
@Apokleos Apokleos force-pushed the pcie-port-devices branch from bf0dbc3 to 0753352 Compare May 15, 2025 12:11
@Apokleos Apokleos merged commit 305a5f5 into kata-containers:main May 18, 2025
289 of 350 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/huge Largest and most complex task (probably needs breaking into small pieces)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants