Skip to content

Conversation

bchalios
Copy link
Contributor

@bchalios bchalios commented Apr 12, 2022

Avoid extra memory copy when writing to tap device in virtio-net TX path

Currently, when performing TX, a network frame might be exist in various scattered in various memory regions described by a DescriptorChain that we receive from the guest. In order to write the frame to the tap device we first perform a number of copies from said memory buffers to a single buffer which we then write to the Tap file descriptor.

This PR reverts that, to use scatter-gather IO using the writev system call which avoids the intermediate memory copies.

Fixes #420

Description of Changes

The main blocker for using the writev system call in our device is that we need to be able to inspect the outgoing frame to check if it is destined towards mmds. This requires us to read at least the frame headers which makes the code quite convoluted.

In order to do this in a clean way, I introduce an IoVec struct which is essentially a Vec of std::io::IoSlice. The struct can be created from a Descriptor chain (without performing any copies) and can be directly used to perform the writev system call.

Moreover, it provides methods for reading ranges of bytes from it (which at the moment perform a copy) and this functionality is used to inspect the ranges corresponding to a destination addresses in order to check if the frame is destined for mmds.

  • This functionality can be added in rust-vmm.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

  • All commits in this PR are signed (git commit -s).
  • The issue which led to this PR has a clear conclusion.
  • This PR follows the solution outlined in the related issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any newly added unsafe code is properly documented.
  • Any API changes follow the Runbook for Firecracker API changes.
  • Any user-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.

@bchalios bchalios requested a review from serban300 April 12, 2022 12:48
@bchalios bchalios added Good first issue Indicates a good issue for first-time contributors Feature: IO Virtualization labels Apr 12, 2022
@bchalios bchalios marked this pull request as draft April 12, 2022 12:50
@bchalios bchalios changed the title [RFC] Feat tx iovec [RFC] Use writev for writing guest outgoing packets to the tap device Apr 12, 2022
@bchalios bchalios self-assigned this Apr 12, 2022
@bchalios
Copy link
Contributor Author

bchalios commented Apr 19, 2022

Leaving here for reference some initial performance measurements for the TX path. Baseline, is latest main.
The measurements are percentage improvement comparing to baseline, i.e. (new_version - baseline) / baseline * 100.0).

#VMs Bandwidth VMM thread CPU load kworker CPU load* Bandwidth per CPU load
1 15.97 -3.70 50.00 19.29
2 13.38 -5.52 0.00 19.35
3 14.14 -2.87 40.00 16.19
4 16.94 -3.40 12.50 20.12
5 13.94 -3.70 0.00 18.08
6 12.41 -4.08 9.09 16.94
7 11.29 -6.16 25.00 17.90
8 7.08 -8.54 23.08 16.49
9 7.31 -10.81 -6.67 20.18
10 1.85 -12.10 6.25 15.59
11 1.56 -13.50 -5.26 17.08
12 0.51 -13.86 23.53 15.76

*kworker load numbers include a lot of noise, since we are measuring the load of all kworker threads on the system.

@bchalios bchalios force-pushed the feat_tx_iovec branch 4 times, most recently from ecfbb9a to 8bd30c4 Compare April 22, 2022 08:03
@bchalios
Copy link
Contributor Author

bchalios commented May 5, 2022

I also tried modifying the RX path in order to use readv.

This required changing a bit the order in which we do things in the RX path. The order roughly goes like:

  1. We get an event from the tap fd
  2. We try to find an available DescriptorChain from the RX queue
  3. We use readv to read from the tap directly inside the the DescriptorChain
  4. Goto 2 until reading from tap returns EAGAIN (or an error occurs)

The implementation works (here is a prototype), however the performance degrades significantly.

I tried to track down the reason, and I took flamegraphs for executions with the test implementation and comparing it against the version with only vectored-writes enabled (this PR).

The flame graph when using only writev:
tx_iovec

and the one when using both writev and readv:
rx_iovec

In the second case the majority of the time for receiving one packet (Net::read_from_mmds_or_tap) is consumed by DescriptorChain::checked_new. This is not true in the former case.

This made me realize, that in the original code we do not parse the whole chain if it is not needed. For example if we are receiving an ICMP frame (54 bytes long) we will parse only one descriptor in the chain. With the new workflow we do not know the size of the packet in advance (tap does not provide us fseek or something similar) so we will parse the whole chain when getting the descriptor to write in. These descriptors chains are >= 64K bytes long with our current configuration and consist of many descriptors, which causes the increased time consumed in DescriptorChain::checked_new.

At the moment, I do not see any way to avoid this issue. The only way around this I could think of is trying to ask the tap how many bytes there are available for reading but this functionality doesn't seem to be there.

@bchalios bchalios force-pushed the feat_tx_iovec branch 2 times, most recently from 523d030 to f18b35d Compare May 30, 2022 09:33
Copy link
Contributor

@acatangiu acatangiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! This is not only more efficient, but I think it also simplifies the emulation code. 💯

leaving a few comments below

@bchalios bchalios force-pushed the feat_tx_iovec branch 3 times, most recently from 3fd7c59 to ce1c8a3 Compare June 2, 2022 07:46
@bchalios bchalios marked this pull request as ready for review June 2, 2022 07:46
@bchalios bchalios changed the title [RFC] Use writev for writing guest outgoing packets to the tap device Use writev for writing guest outgoing packets to the tap device Jun 2, 2022
@bchalios bchalios force-pushed the feat_tx_iovec branch 2 times, most recently from 606dc38 to 89828d7 Compare June 2, 2022 14:54
@bchalios
Copy link
Contributor Author

I used the tool of @zulinx86 to quantify the performance change introduced by this PR on all the platforms we support.

TCP throughput results:

kernel m5d (Sky lake) m5d (Cascade lake) m6i m6gd
4.14 3.1% 4.4% 7.6% 13.5%
5.10 3.7% 5.2% 7.3% 16.5%

network latency results:

kernel m5d (Sky lake) m5d (Cascade lake) m6i m6gd
4.14 -9.8% -1.8% -0.6% 0.0%
5.10 0.7% 1.0% 0.5% 2.9%

acatangiu
acatangiu previously approved these changes Dec 12, 2022
Copy link
Contributor

@acatangiu acatangiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buy this man a beer! 🍻

alxiord
alxiord previously approved these changes Dec 13, 2022
@bchalios bchalios force-pushed the feat_tx_iovec branch 3 times, most recently from 74d0067 to 3de46ac Compare December 28, 2022 10:10
The standard library handles vectored writes and reads using the
`IoSlice` and `IoSliceMut` respectively. We introduce a new type,
`IoVecBuffer` which essentially is a vector of `IoSlice` and is able to
be instantiated from a `DescriptorChain`.

`IoVecBuffer` provides us the necessary bits for using `write_vectored`
to transmit packets in the tap.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
This allows us to avoid an extra user-space copy of the packet we are
transmitting to the tap. We use directly the buffers described by the
`DescriptorChain` by means of `IoVecBuffer`s.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Move the tap mock object from the `Net` device to the `Tap` iteslf and
extend said object to allow us to mock as well tap write failures.

Also, add a few extra unit-tests in the mmds network stack.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Update baselines to reflect the improvements in the TX path due to
vectored writes.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Update baselines to reflect the (small) improvements in the TX path due
to vectored writes.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Signed-off-by: Babis Chalios <bchalios@amazon.es>
@bchalios bchalios merged commit d2c52b8 into firecracker-microvm:main Jan 2, 2023
@bchalios bchalios deleted the feat_tx_iovec branch January 2, 2023 11:43
roypat added a commit to roypat/firecracker that referenced this pull request Feb 17, 2023
Added missing entries for 1.3 release
- bchalios@'s net scatter-gather improvements (firecracker-microvm#2958)
- seccompiler change to make builds reproducible (firecracker-microvm#3445)

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
roypat added a commit to roypat/firecracker that referenced this pull request Feb 17, 2023
Added missing entries for 1.3 release
- bchalios@'s net scatter-gather improvements (firecracker-microvm#2958)
- seccompiler change to make builds reproducible (firecracker-microvm#3445)

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
roypat added a commit to roypat/firecracker that referenced this pull request Feb 17, 2023
Added missing entries for 1.3 release
- bchalios@'s net scatter-gather improvements (firecracker-microvm#2958)
- seccompiler change to make builds reproducible (firecracker-microvm#3445)

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
luminitavoicu pushed a commit that referenced this pull request Feb 17, 2023
Added missing entries for 1.3 release
- bchalios@'s net scatter-gather improvements (#2958)
- seccompiler change to make builds reproducible (#3445)

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
jkrshnmenon pushed a commit to jkrshnmenon/firecracker that referenced this pull request Mar 7, 2023
Added missing entries for 1.3 release
- bchalios@'s net scatter-gather improvements (firecracker-microvm#2958)
- seccompiler change to make builds reproducible (firecracker-microvm#3445)

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
JonathanWoollett-Light pushed a commit to JonathanWoollett-Light/firecracker that referenced this pull request Mar 17, 2023
Added missing entries for 1.3 release
- bchalios@'s net scatter-gather improvements (firecracker-microvm#2958)
- seccompiler change to make builds reproducible (firecracker-microvm#3445)

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good first issue Indicates a good issue for first-time contributors Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use scatter-gather IO in Firecracker virtio-net TX path
7 participants