Skip to content

Conversation

kevinhartman
Copy link
Contributor

@kevinhartman kevinhartman commented Nov 7, 2023

Summary

This PR includes:

  • A new method replace_bits on the Rust type CircuitData (introduced in Oxidize QuantumCircuit._data and intern CircuitInstruction args #10827) that takes advantage of CircuitData's interned representation of qubits and clbits as vectors of indices to provide circuit remapping in constant time with respect to the number of existing instructions.
  • Three new related methods, CircuitData::map_ops, CircuitData::foreach_op and CircuitData::enumerate_ops that are used to invoke a Python callable for each operation in the circuit without constructing ephemeral CircuitInstruction instances. These methods provide a considerable performance boost for client code that was previously iterating over CircuitData but ignoring the qubits and clbits of yielded CircuitInstructions.
  • Updates to ControlFlowBuilderBlock's instruction storage, which now uses CircuitData instead of list. This also includes a new method CircuitScopeInterface.extend which is used by QuantumCircuit.compose to update the currently scoped ControlFlowBuilderBlock all in one go.

Details and comments

CircuitData::replace_bits

In its current form, CircuitData owns the qubits and clbits of a QuantumCircuit. As instructions are added to it, these qubits and clbits sequences are used to convert the instruction's bits from Qubit and Clbit instances to the corresponding indices, which are then stored with the InternContext. On access (e.g. via lookup or iteration), CircuitData retrieves indices from the InternContext and performs a lookup in the opposite direction to get back to Qubit and Clbit instances. Because we do this lookup anyway, replace_bits gives us a way to perform circuit remapping in constant time by simply swapping-out a CircuitData's clbits and qubits.

This is used to accelerate QuantumCircuit.compose and circuit_to_instruction.

CircuitData::map_ops, CircuitData::foreach_opandCircuitData::enumerate_ops`

In several cases, we don't actually need to access the qubits and clbits of CircuitInstruction instances while iterating over a CircuitData. For example, when updating a ParameterTable. And now with the introduction of replace_bits, we also have a use case for these functions to update the operation on conditional instructions during bit remapping.

CircuitScopeInterface.extend

This new method takes a CircuitData instance and uses it to extend the block's instruction listing all in one go. Because blocks now store their instructions in a CircuitData, a fast path has been added to CircuitData::extend that avoids construction of intermediate CircuitInstruction instances which would simply be thrown away anyway.

Todo

  • Add comment with performance graphs comparing 0.45, oxidize-qc-data, and this branch, oxidize-qc-data-perf.
  • Unit testing for new methods.
  • Minor refactoring to make CircuitData::replace_bits cleaner.
  • Update doc strings to be in Python/Sphinx form for pyclass methods.
  • Rebase and publish.

@coveralls
Copy link

coveralls commented Nov 7, 2023

Pull Request Test Coverage Report for Build 7412912864

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.05%) to 87.548%

Totals Coverage Status
Change from base Build 7412531238: 0.05%
Covered Lines: 59454
Relevant Lines: 67910

💛 - Coveralls

@mtreinish mtreinish added on hold Can not fix yet performance Rust This PR or issue is related to Rust code in the repository labels Nov 13, 2023
@kevinhartman
Copy link
Contributor Author

kevinhartman commented Nov 13, 2023

Here are some updated benchmarks for circuit building comparing memory pressure and runtime between 0.45.0, the original oxidize-qc-data branch, and oxidize-qc-data-perf (this PR branch).

The graphs show a sweep from depth=0 to depth=2000. Memory usage between oxidize-qc-data and oxidize-qc-data-perf is mostly the same, while performance is dramatically improved.

2Local 127Q, flatten=False

When flatten is False, the optimizations to QuantumCircuit.compose done in this PR improve circuit building performance even beyond what we currently get in 0.45.0.

At the upper end (depth=2000), 0.45 takes ~119s, while oxidize-qc-data-perf takes ~99s, a ~20s speedup.

qc = TwoLocal(
    127,
    flatten=False,
    rotation_blocks=[SXGate(), "rz", SXGate()],
    entanglement_blocks=ECRGate(),
    entanglement='pairwise',
    reps=int(sys.argv[1])
)
qc._build()

2Local 127Q, flatten=True

When flatten is True, we're still slower than 0.45.0, but we do quite a bit better than the original oxidize-qc-data branch.

At the upper end (depth=2000), 0.45 takes ~70s, while oxidize-qc-data-perf takes ~79s, a ~9 second slowdown.

qc = TwoLocal(
    127,
    flatten=True,
    rotation_blocks=[SXGate(), "rz", SXGate()],
    entanglement_blocks=ECRGate(),
    entanglement='pairwise',
    reps=int(sys.argv[1])
)
qc._build()

@kevinhartman
Copy link
Contributor Author

To demonstrate the impact of these changes on transpilation time relative to 0.45 and oxidize-qc-data, here's a graph resulting from building and transpiling the following QV 127 Q circuit at depths sweeping from 0 to 1000, with a step size of 200.

We wouldn't expect much difference in transpilation time anyway since we convert to a DAGCircuit in all versions, but I thought I'd include this to validate that assumption.

import sys
from qiskit import QuantumCircuit, transpile
from qiskit.circuit.library import QuantumVolume
from qiskit.providers.fake_provider import FakeWashington


backend = FakeWashington()

qv = QuantumVolume(127, int(sys.argv[1]), 1337)
qv.measure_all()

tqc = transpile(
    qv,
    backend,
    seed_transpiler=1334,
    layout_method="sabre",
    routing_method="sabre",
    optimization_level=0
)

qv_transpile

Use CircuitData in builder.py.

Add fastpath for CircuitData::extend(CircuitData).

Fix merge conflicts.

Add docstrings and reorder functions for an improved diff.

Optimize QuantumCircuit.copy for CircuitData.

Run formatting after merge.

Update docstrings.

Improve doc comment for replace_bits.

Fix lint.

Optimize copy.

Fix merge.

Refactor replace_bits.

Run formatter.

Use Python form docstrings in API.

Fix lint.
@kevinhartman kevinhartman marked this pull request as ready for review November 21, 2023 04:14
@kevinhartman kevinhartman requested review from woodsp-ibm, ikkoham and a team as code owners November 21, 2023 04:14
@qiskit-bot
Copy link
Collaborator

One or more of the the following people are requested to review this:

  • @Eric-Arellano
  • @Cryoris
  • @Qiskit/terra-core
  • @kevinhartman
  • @mtreinish
  • @woodsp-ibm

@kevinhartman kevinhartman removed the on hold Can not fix yet label Nov 22, 2023
@1ucian0 1ucian0 self-assigned this Nov 27, 2023
Copy link
Member

@1ucian0 1ucian0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python part LGTM. But probably another pair of eyes would be better

@1ucian0 1ucian0 removed their assignment Nov 27, 2023
Copy link
Member

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this looks sensible to me. If we're bikeshedding names, I'd perhaps suggest map_ops instead of replace_ops, just to match more with the standard Python and Rust terminology for that operation, but it's not very important.

Could you show the performance characteristics of the Pauli.to_instruction and QuantumCircuit.copy asv benchmarks between the three reference points as well, just to check everything? The changeset here looks strongly like you've directly targetted those benchmarks, so I'm assuming you already have that data somewhere?

This simplifies caller logic when appending an
instruction listing to a scope by leaving it up
to the scope implementation to determine what is
required (e.g. updating the parameter table only
for QuantumCircuit scopes vs just tracking bits
and instructions for a control flow block.
@kevinhartman
Copy link
Contributor Author

@jakelishman:

Could you show the performance characteristics of the Pauli.to_instruction and QuantumCircuit.copy asv benchmarks between the three reference points as well, just to check everything?

main(@181208) vs PR

This one should be mostly better, given that the original PR #10827 is already in main and this PR optimizes that work. We get this speedup to main by merging this PR (which we absolutely should do).

> asv compare -s 18120853a3a4f1cab461119e9643eec6029ea568 oxidize-qc-data-perf

Benchmarks that have improved:

| Change   | Before [18120853] <oxidize-qc-data-perf~4^2>   | After [c7ecf358] <oxidize-qc-data-perf>   |   Ratio | Benchmark (Parameter)                                                       |
|----------|------------------------------------------------|-------------------------------------------|---------|-----------------------------------------------------------------------------|
| -        | 217±4μs                                        | 40.1±0.3μs                                |    0.18 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 128)     |
| -        | 199±2ms                                        | 23.5±0.3ms                                |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 131072)  |
| -        | 3.23±0.05ms                                    | 382±30μs                                  |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 2048)    |
| -        | 49.3±0.1ms                                     | 5.89±0.05ms                               |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 32768)   |
| -        | 33.9±1μs                                       | 26.4±2μs                                  |    0.78 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 8)       |
| -        | 12.8±0.2ms                                     | 1.51±0.04ms                               |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 8192)    |
| -        | 244±2μs                                        | 52.4±1μs                                  |    0.21 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 128)    |
| -        | 210±9ms                                        | 24.0±0.3ms                                |    0.11 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 131072) |
| -        | 4.87±0.1ms                                     | 390±3μs                                   |    0.08 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 2048)   |
| -        | 52.8±0.6ms                                     | 5.96±0.04ms                               |    0.11 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 32768)  |
| -        | 73.7±0.7μs                                     | 30.7±2μs                                  |    0.42 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 8)      |
| -        | 13.2±0.3ms                                     | 1.50±0.02ms                               |    0.11 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 8192)   |
| -        | 225±8μs                                        | 43.5±3μs                                  |    0.19 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 128)     |
| -        | 210±5ms                                        | 25.0±0.6ms                                |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 131072)  |
| -        | 3.27±0.02ms                                    | 384±5μs                                   |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 2048)    |
| -        | 51.3±0.6ms                                     | 6.14±0.5ms                                |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 32768)   |
| -        | 36.1±0.8μs                                     | 19.0±0.6μs                                |    0.53 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 8)       |
| -        | 12.8±0.2ms                                     | 1.48±0.02ms                               |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 8192)    |
| -        | 299±20μs                                       | 60.3±0.6μs                                |    0.2  | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 128)    |
| -        | 208±4ms                                        | 24.4±3ms                                  |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 131072) |
| -        | 3.32±0.08ms                                    | 399±5μs                                   |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 2048)   |
| -        | 53.2±1ms                                       | 5.97±0.1ms                                |    0.11 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 32768)  |
| -        | 96.0±2μs                                       | 39.7±1μs                                  |    0.41 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 8)      |
| -        | 13.0±0.2ms                                     | 1.50±0.02ms                               |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 8192)   |
| -        | 242±4μs                                        | 45.9±1μs                                  |    0.19 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 128)     |
| -        | 211±7ms                                        | 25.4±0.7ms                                |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 131072)  |
| -        | 3.22±0.06ms                                    | 403±10μs                                  |    0.13 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 2048)    |
| -        | 51.4±0.9ms                                     | 6.11±0.05ms                               |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 32768)   |
| -        | 38.7±0.8μs                                     | 23.4±3μs                                  |    0.6  | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 8)       |
| -        | 12.8±0.2ms                                     | 1.54±0.02ms                               |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 8192)    |
| -        | 249±3μs                                        | 58.6±10μs                                 |    0.24 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 128)     |
| -        | 213±4ms                                        | 24.4±0.1ms                                |    0.11 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 131072)  |
| -        | 3.25±0.05ms                                    | 393±4μs                                   |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 2048)    |
| -        | 55.8±2ms                                       | 6.06±0.1ms                                |    0.11 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 32768)   |
| -        | 51.6±1μs                                       | 24.8±0.3μs                                |    0.48 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 8)       |
| -        | 13.0±0.1ms                                     | 1.51±0.04ms                               |    0.12 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 8192)    |
| -        | 169±4μs                                        | 116±2μs                                   |    0.69 | converters.ConverterBenchmarks.time_circuit_to_instruction(1, 8)            |
| -        | 276±50μs                                       | 138±3μs                                   |    0.5  | converters.ConverterBenchmarks.time_circuit_to_instruction(2, 8)            |

Benchmarks that have stayed the same:

| Change   | Before [18120853] <oxidize-qc-data-perf~4^2>   | After [c7ecf358] <oxidize-qc-data-perf>   | Ratio   | Benchmark (Parameter)                                                |
|----------|------------------------------------------------|-------------------------------------------|---------|----------------------------------------------------------------------|
|          | 1.44±0.7ms                                     | 1.58±0.02ms                               | 1.09    | quantum_info.PauliBench.time_to_instruction(100)                     |
|          | 6.41±2ms                                       | 7.44±3ms                                  | ~1.16   | quantum_info.PauliBench.time_to_instruction(300)                     |
|          | 10.2±0.1ms                                     | 5.73±6ms                                  | ~0.56   | quantum_info.PauliBench.time_to_instruction(400)                     |
|          | 15.0±8ms                                       | 8.97±9ms                                  | ~0.60   | quantum_info.PauliBench.time_to_instruction(500)                     |

Benchmarks that have got worse:

| Change   | Before [18120853] <oxidize-qc-data-perf~4^2>   | After [c7ecf358] <oxidize-qc-data-perf>   |   Ratio | Benchmark (Parameter)                            |
|----------|------------------------------------------------|-------------------------------------------|---------|--------------------------------------------------|
| +        | 3.54±0.06ms                                    | 3.92±1ms                                  |    1.11 | quantum_info.PauliBench.time_to_instruction(200) |

stable/0.45 (@908e55) vs PR

This compares the pre-oxidization (just Python list) QuantumCircuit.data to oxidized version with optimizations in place. It seems to indicate that circuit copying is faster than it was before the memory optimizations, but that Pauli.to_instruction is often slower.

> asv compare -s upstream/stable/0.45 oxidize-qc-data-perf

Benchmarks that have improved:

| Change   | Before [908e555d]    | After [c7ecf358] <oxidize-qc-data-perf>   |   Ratio | Benchmark (Parameter)                                                       |
|----------|----------------------|-------------------------------------------|---------|-----------------------------------------------------------------------------|
| -        | 108±0.7μs            | 40.1±0.3μs                                |    0.37 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 128)     |
| -        | 94.7±0.9ms           | 23.5±0.3ms                                |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 131072)  |
| -        | 1.49±0.02ms          | 382±30μs                                  |    0.26 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 2048)    |
| -        | 23.6±2ms             | 5.89±0.05ms                               |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 32768)   |
| -        | 5.80±0.1ms           | 1.51±0.04ms                               |    0.26 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 8192)    |
| -        | 118±4μs              | 52.4±1μs                                  |    0.45 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 128)    |
| -        | 96.8±0.7ms           | 24.0±0.3ms                                |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 131072) |
| -        | 1.49±0.02ms          | 390±3μs                                   |    0.26 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 2048)   |
| -        | 24.1±0.7ms           | 5.96±0.04ms                               |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 32768)  |
| -        | 5.84±0.07ms          | 1.50±0.02ms                               |    0.26 | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 8192)   |
| -        | 119±10μs             | 43.5±3μs                                  |    0.36 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 128)     |
| -        | 99.1±2ms             | 25.0±0.6ms                                |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 131072)  |
| -        | 1.65±0.1ms           | 384±5μs                                   |    0.23 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 2048)    |
| -        | 24.6±0.3ms           | 6.14±0.5ms                                |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 32768)   |
| -        | 5.92±0.03ms          | 1.48±0.02ms                               |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 8192)    |
| -        | 127±1μs              | 60.3±0.6μs                                |    0.48 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 128)    |
| -        | 98.3±2ms             | 24.4±3ms                                  |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 131072) |
| -        | 1.50±0.02ms          | 399±5μs                                   |    0.27 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 2048)   |
| -        | 24.4±0.5ms           | 5.97±0.1ms                                |    0.24 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 32768)  |
| -        | 5.87±0.5ms           | 1.50±0.02ms                               |    0.26 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 8192)   |
| -        | 126±10μs             | 45.9±1μs                                  |    0.36 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 128)     |
| -        | 98.1±0.9ms           | 25.4±0.7ms                                |    0.26 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 131072)  |
| -        | 1.50±0.02ms          | 403±10μs                                  |    0.27 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 2048)    |
| -        | 23.9±0.3ms           | 6.11±0.05ms                               |    0.26 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 32768)   |
| -        | 5.90±0.03ms          | 1.54±0.02ms                               |    0.26 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 8192)    |
| -        | 115±2μs              | 58.6±10μs                                 |    0.51 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 128)     |
| -        | 99.4±2ms             | 24.4±0.1ms                                |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 131072)  |
| -        | 1.48±0.03ms          | 393±4μs                                   |    0.27 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 2048)    |
| -        | 24.1±0.4ms           | 6.06±0.1ms                                |    0.25 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 32768)   |
| -        | 5.87±0.1ms           | 1.51±0.04ms                               |    0.26 | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 8192)    |
| -        | 135±8μs              | 116±2μs                                   |    0.86 | converters.ConverterBenchmarks.time_circuit_to_instruction(1, 8)            |
| -        | 175±3μs              | 138±3μs                                   |    0.79 | converters.ConverterBenchmarks.time_circuit_to_instruction(2, 8)            |

Benchmarks that have stayed the same:

| Change   | Before [908e555d]    | After [c7ecf358] <oxidize-qc-data-perf>   | Ratio   | Benchmark (Parameter)                                                  |
|----------|----------------------|-------------------------------------------|---------|------------------------------------------------------------------------|
|          | 33.4±0.3μs           | 30.7±2μs                                  | 0.92    | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 8) |
|          | 19.8±0.5μs           | 19.0±0.6μs                                | 0.96    | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 8)  |
|          | 42.0±0.4μs           | 39.7±1μs                                  | 0.94    | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 8) |
|          | 23.8±0.3μs           | 24.8±0.3μs                                | 1.04    | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 8)  |
|          | 724±300μs            | 1.58±0.02ms                               | ~2.18   | quantum_info.PauliBench.time_to_instruction(100)                       |
|          | 1.79±0.03ms          | 7.44±3ms                                  | ~4.15   | quantum_info.PauliBench.time_to_instruction(300)                       |
|          | 2.25±1ms             | 5.73±6ms                                  | ~2.54   | quantum_info.PauliBench.time_to_instruction(400)                       |
|          | 2.84±0.1ms           | 8.97±9ms                                  | ~3.16   | quantum_info.PauliBench.time_to_instruction(500)                       |

Benchmarks that have got worse:

| Change   | Before [908e555d]    | After [c7ecf358] <oxidize-qc-data-perf>   |   Ratio | Benchmark (Parameter)                                                 |
|----------|----------------------|-------------------------------------------|---------|-----------------------------------------------------------------------|
| +        | 19.4±0.5μs           | 26.4±2μs                                  |    1.36 | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 8) |
| +        | 19.0±0.4μs           | 23.4±3μs                                  |    1.23 | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 8) |
| +        | 1.34±0.05ms          | 3.92±1ms                                  |    2.94 | quantum_info.PauliBench.time_to_instruction(200)                      |

@kevinhartman
Copy link
Contributor Author

I've also run benchmarks for circuit_to_instruction between stable/0.45 (908e555) and the PR branch, which show that this method is faster and scales better than before any of the QuantumCircuit.data oxidation work 😄:

Benchmarks that have improved:

| Change   | Before [908e555d]    | After [a2f2fbd7] <oxidize-qc-data-perf>   |   Ratio | Benchmark (Parameter)                                                |
|----------|----------------------|-------------------------------------------|---------|----------------------------------------------------------------------|
| -        | 132±4μs              | 120±1μs                                   |    0.91 | converters.ConverterBenchmarks.time_circuit_to_instruction(1, 8)     |
| -        | 174±20μs             | 141±2μs                                   |    0.81 | converters.ConverterBenchmarks.time_circuit_to_instruction(2, 8)     |
| -        | 270±4μs              | 201±3μs                                   |    0.75 | converters.ConverterBenchmarks.time_circuit_to_instruction(5, 8)     |
| -        | 357±6μs              | 252±5μs                                   |    0.71 | converters.ConverterBenchmarks.time_circuit_to_instruction(8, 8)     |
| -        | 1.77±0.04ms          | 1.22±0.06ms                               |    0.69 | converters.ConverterBenchmarks.time_circuit_to_instruction(53, 8)    |
| -        | 538±6μs              | 362±30μs                                  |    0.67 | converters.ConverterBenchmarks.time_circuit_to_instruction(14, 8)    |
| -        | 721±8μs              | 482±6μs                                   |    0.67 | converters.ConverterBenchmarks.time_circuit_to_instruction(20, 8)    |
| -        | 1.13±0.03ms          | 697±9μs                                   |    0.62 | converters.ConverterBenchmarks.time_circuit_to_instruction(32, 8)    |
| -        | 720±7μs              | 415±90μs                                  |    0.58 | converters.ConverterBenchmarks.time_circuit_to_instruction(1, 128)   |
| -        | 1.04±0.02ms          | 563±20μs                                  |    0.54 | converters.ConverterBenchmarks.time_circuit_to_instruction(2, 128)   |
| -        | 14.9±0.4ms           | 7.38±0.1ms                                |    0.5  | converters.ConverterBenchmarks.time_circuit_to_instruction(2, 2048)  |
| -        | 9.14±0.1ms           | 4.41±0.2ms                                |    0.48 | converters.ConverterBenchmarks.time_circuit_to_instruction(1, 2048)  |
| -        | 61.0±1ms             | 29.2±0.3ms                                |    0.48 | converters.ConverterBenchmarks.time_circuit_to_instruction(2, 8192)  |
| -        | 36.1±0.5ms           | 16.5±0.3ms                                |    0.46 | converters.ConverterBenchmarks.time_circuit_to_instruction(1, 8192)  |
| -        | 2.00±0.03ms          | 916±20μs                                  |    0.46 | converters.ConverterBenchmarks.time_circuit_to_instruction(5, 128)   |
| -        | 2.99±0.04ms          | 1.27±0.06ms                               |    0.43 | converters.ConverterBenchmarks.time_circuit_to_instruction(8, 128)   |
| -        | 10.6±0.5ms           | 4.41±0.1ms                                |    0.42 | converters.ConverterBenchmarks.time_circuit_to_instruction(32, 128)  |
| -        | 4.95±0.06ms          | 2.02±0.02ms                               |    0.41 | converters.ConverterBenchmarks.time_circuit_to_instruction(14, 128)  |
| -        | 28.5±0.5ms           | 11.6±0.2ms                                |    0.41 | converters.ConverterBenchmarks.time_circuit_to_instruction(5, 2048)  |
| -        | 77.1±2ms             | 30.7±0.4ms                                |    0.4  | converters.ConverterBenchmarks.time_circuit_to_instruction(14, 2048) |
| -        | 6.92±0.05ms          | 2.80±0.1ms                                |    0.4  | converters.ConverterBenchmarks.time_circuit_to_instruction(20, 128)  |
| -        | 126±3ms              | 50.5±1ms                                  |    0.4  | converters.ConverterBenchmarks.time_circuit_to_instruction(5, 8192)  |
| -        | 17.3±0.5ms           | 6.87±0.05ms                               |    0.4  | converters.ConverterBenchmarks.time_circuit_to_instruction(53, 128)  |
| -        | 44.5±2ms             | 17.7±0.9ms                                |    0.4  | converters.ConverterBenchmarks.time_circuit_to_instruction(8, 2048)  |

Benchmarks that have stayed the same:

| Change   | Before [908e555d]    | After [a2f2fbd7] <oxidize-qc-data-perf>   | Ratio   | Benchmark (Parameter)                                                |
|----------|----------------------|-------------------------------------------|---------|----------------------------------------------------------------------|
|          | 2.87±0.09ms          | 17.4±9ms                                  | ~6.05   | quantum_info.PauliBench.time_to_instruction(500)                     |
|          | 2.31±0.9ms           | 12.3±5ms                                  | ~5.35   | quantum_info.PauliBench.time_to_instruction(400)                     |
|          | 1.79±0.02ms          | 7.62±3ms                                  | ~4.25   | quantum_info.PauliBench.time_to_instruction(300)                     |
|          | 1.25±0.5ms           | 3.92±2ms                                  | ~3.14   | quantum_info.PauliBench.time_to_instruction(200)                     |
|          | 186±6ms              | 72.2±1ms                                  | ~0.39   | converters.ConverterBenchmarks.time_circuit_to_instruction(8, 8192)  |
|          | 741±300μs            | 43.2±800μs                                | ~0.06   | quantum_info.PauliBench.time_to_instruction(100)                     |
|          | n/a                  | n/a                                       | n/a     | converters.ConverterBenchmarks.time_circuit_to_instruction(14, 8192) |
|          | n/a                  | n/a                                       | n/a     | converters.ConverterBenchmarks.time_circuit_to_instruction(20, 2048) |
|          | n/a                  | n/a                                       | n/a     | converters.ConverterBenchmarks.time_circuit_to_instruction(20, 8192) |
|          | n/a                  | n/a                                       | n/a     | converters.ConverterBenchmarks.time_circuit_to_instruction(32, 2048) |
|          | n/a                  | n/a                                       | n/a     | converters.ConverterBenchmarks.time_circuit_to_instruction(32, 8192) |
|          | n/a                  | n/a                                       | n/a     | converters.ConverterBenchmarks.time_circuit_to_instruction(53, 2048) |
|          | n/a                  | n/a                                       | n/a     | converters.ConverterBenchmarks.time_circuit_to_instruction(53, 8192) |

Copy link
Member

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very close to being ready to land - the performance improvements certainly look great!

Aside from two fairly minor comments here, I guess there's a little bit of merge conflict that'll need fixing, and there's a couple of not-yet-resolved comment chains I replied to in the previous review:

Copy link
Member

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes on this! All the comments I had are resolved, so I think we should merge this and start getting tested in general usage (at least on the main branch!).

@jakelishman jakelishman added this pull request to the merge queue Jan 4, 2024
Merged via the queue into Qiskit:main with commit 3cc948f Jan 4, 2024
ShellyGarion pushed a commit to ShellyGarion/qiskit-terra that referenced this pull request Jan 18, 2024
* Optimize QC compose and circuit_to_instruction.

Use CircuitData in builder.py.

Add fastpath for CircuitData::extend(CircuitData).

Fix merge conflicts.

Add docstrings and reorder functions for an improved diff.

Optimize QuantumCircuit.copy for CircuitData.

Run formatting after merge.

Update docstrings.

Improve doc comment for replace_bits.

Fix lint.

Optimize copy.

Fix merge.

Refactor replace_bits.

Run formatter.

Use Python form docstrings in API.

Fix lint.

* Update comment.

* Add unit tests.

* Make active_bits stable and preallocate.

* Improve readability of add_{qu,cl}bit methods.

* Clarify replace_bits docstring.

* Remove unnecessary type hint.

* Make ControlFlowBuilderBlock qubits and clbits methods.

* Remove TODO.

* Use single replace_ops in QuantumCircuit.copy.

* Use memoization in circuit_to_instruction. Style fixes.

* Add CircuitScopeInterface.extend.

This simplifies caller logic when appending an
instruction listing to a scope by leaving it up
to the scope implementation to determine what is
required (e.g. updating the parameter table only
for QuantumCircuit scopes vs just tracking bits
and instructions for a control flow block.

* Simplify logic for front compose.

* Add enumerate_ops, optimize block builder.

* Rename replace_ops to map_ops.

* Fix benchmark random_circuit.

* Add default strict mode for bits.

* Return tuple of sets from active_bits.

* Rename 'enumerate_ops' to 'foreach_op_indexed'.

* Add _track_operation method.

* Fix merge for global phase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance priority: high Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants