Skip to content

Add new multithreaded TwoQubitPeepholeOptimization pass #13419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

mtreinish
Copy link
Member

@mtreinish mtreinish commented Nov 10, 2024

Summary

This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substitution. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see #11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

  • Only supports the target
  • It doesn't support the XX decomposer because it's not in rust (the TwoQubitBasisDecomposer and TwoQubitControlledUDecomposer are used)

This pass doesn't support using the unitary synthesis plugin interface, since
it's optimized to use Qiskit's built-in two qubit synthesis routines written in
Rust. The existing combination of ConsolidateBlocks and UnitarySynthesis
should be used instead if the plugin interface is necessary.

Details and comments

Fixes #12007
Fixes #11659

TODO:

@mtreinish mtreinish added performance Changelog: New Feature Include in the "Added" section of the changelog Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Nov 10, 2024
@mtreinish mtreinish added this to the 2.0.0 milestone Nov 10, 2024
@coveralls
Copy link

coveralls commented Nov 10, 2024

Pull Request Test Coverage Report for Build 15654439601

Details

  • 576 of 611 (94.27%) changed or added relevant lines in 12 files are covered.
  • 22 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.02%) to 88.02%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/synthesis/src/two_qubit_decompose.rs 16 17 94.12%
crates/transpiler/src/passes/unitary_synthesis.rs 12 13 92.31%
crates/transpiler/src/passes/two_qubit_unitary_synthesis_utils.rs 133 139 95.68%
crates/transpiler/src/passes/two_qubit_peephole.rs 376 403 93.3%
Files with Coverage Reduction New Missed Lines %
crates/qasm2/src/lex.rs 2 92.48%
crates/transpiler/src/passes/unitary_synthesis.rs 2 94.33%
crates/circuit/src/symbol_expr.rs 6 73.69%
crates/qasm2/src/parse.rs 12 96.68%
Totals Coverage Status
Change from base Build 15635869582: 0.02%
Covered Lines: 83570
Relevant Lines: 94944

💛 - Coveralls

This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substituion. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see Qiskit#11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

- Only supports the target
- It doesn't support any synthesizers besides the TwoQubitBasisDecomposer,
  because it's the only one in rust currently.

For plugin handling I left the logic as running the three pass series,
but I'm not sure this is the behavior we want. We could say keep the
synthesis plugins for `UnitarySynthesis` only and then rely on our
built-in methods for physical optimiztion only. But this also seems less
than ideal because the plugin mechanism is how we support synthesizing
to custom basis gates, and also more advanced approximate synthesis
methods. Both of those are things we need to do as part of the synthesis
here.

Additionally, this is currently missing tests and documentation and while
running it manually "works" as in it returns a circuit that looks valid,
I've not done any validation yet. This also likely will need several
rounds of performance optimization and tuning. t this point this is
just a rough proof of concept and will need a lof refinement along with
larger changes to Qiskit's rust code before this is ready to merge.

Fixes Qiskit#12007
Fixes Qiskit#11659
Since Qiskit#13139 merged we have another two qubit decomposer available to
run in rust, the TwoQubitControlledUDecomposer. This commit updates the
new TwoQubitPeepholeOptimization to call this decomposer if the target
supports appropriate 2q gates.
Clippy is correctly warning that the size difference between the two
decomposer types in the TwoQubitDecomposer enumese two types is large.
TwoQubitBasisDecomposer is 1640 bytes and TwoQubitControlledUDecomposer
is only 24 bytes. This means each element of ControlledU is wasting
> 1600 bytes. However, in this case that is acceptable in order to
avoid a layer of pointer indirection as these are stored temporarily
in a vec inside a thread to decompose a unitary. A trait would be more
natural for this to define a common interface between all the two qubit
decomposers but since we keep them instantiated for each edge in a Vec
they need to be sized and doing something like
`Box<dyn TwoQubitDecomposer>` (assuming a trait `TwoQubitDecomposer`
instead of a enum) to get around this would have additional runtime
overhead. This is also considering that TwoQubitControlledUDecomposer
has far less likelihood in practice as it only works with some targets
that have RZZ, RXX, RYY, or RZX gates on an edge which is less common.
Also don't run scoring more than needed.
@ShellyGarion
Copy link
Member

Copy here the comment of @t-imamichi #13568 (comment)
and my reply: #13568 (comment)

I think this closes #13428. How about adding a test case of consecutive RZZ (RXX, and RYY) gates?

We should make sure that after PR #13568 and this PR will be merged, we can efficiently transpile circuits into basis fractional RZZ gates .

@mtreinish
Copy link
Member Author

I added support for using the ControlledUDecomposer to the new pass back in early December with this commit: 746758f although looking at that now with fresh eyes I need to check that the gate is continuous in the target, right now it only looks at the supported gate types.

@ShellyGarion ShellyGarion self-assigned this Feb 3, 2025
This commit updates some of the pass's logic and handles some bugs that
were found during development. It also adds a couple new test cases, one
of which is failing. There still seem to be bugs in the case we need to
use a reverse edge in the target.
mtreinish and others added 6 commits June 10, 2025 10:17
This commit deduplicates the code paths between the unitary synthesis
code and the two qubit peephole pass. The two passes both have the job
of taking a unitary matrix and synthesizing it to a given gate sequence
in roughly the same way. The context both passes are run is slightly
different, as unitary synthesis is more general and two qubit peephole
optimization is specifically only on physical circuits and also only
operates on 4x4 matrices. But we didn't need to duplicate the code that
goes from a matrix and decomposer to a sequence. This commit moves the
code from unitary synthesis into a common module and then updates
the peephole pass to use that instead of duplicating the implementation.
After the previous commit deduplicated the synthesis path to fix the
directionality bug and also simplify the shared code there was one issue
introduced. The peephole pass is target only and was previously using
the target to store details about custom operations when a sentinel
value was found in the decomposition gate sequence. However unitary
synthesis can work in the absence of a target and was handling this
manually. This caused a mismatch in expectations both the peephole pass
was no using the correct name for target lookups in some cases and
during applying the synthesis custom gates in the target would not be
inserted correctly.

Ideally this will al lbe cleaned up after Qiskit#14418 is fixed since we won't
have sentinel values to represent non-CX gates anymore in the two qubit
decomposers.
This commit expands the test coverage to ensure that single gate
two qubit gate synthesis matches expectations with different basis.

Co-authored-by: Shelly Garion <46566946+ShellyGarion@users.noreply.github.com>
@mtreinish mtreinish changed the title [WIP] Add new multithreaded TwoQubitPeepholeOptimization pass Add new multithreaded TwoQubitPeepholeOptimization pass Jun 14, 2025
@mtreinish mtreinish removed their assignment Jun 14, 2025
@mtreinish mtreinish marked this pull request as ready for review June 14, 2025 17:14
@mtreinish mtreinish requested a review from a team as a code owner June 14, 2025 17:14
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core
  • @levbishop

Copy link
Member

@ShellyGarion ShellyGarion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extremely important PR. Do you have some benchmarks for the runtime?

There are a few issues regarding synthesis with RZZ/RXX gates, does this PR fix them? If so, we need to include some tests:
#14350
#13431
#13428 (comment) (this should already have a test)

I also have a few comments and questions.
Should this pass be added as default for optimilzation_level=2 and 3? instead of UnitarySynthesis and ConsolidateBlocks?

StandardGate::RXX,
StandardGate::RZZ,
StandardGate::RYY,
StandardGate::RZX,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are more standard gates that are equivalent here:
CPhase, CRX, CRY, CRZ, CU

unoptimized.cx(1, 0)

# Generate a target with random error rates
target = GenericBackendV2(2, ["u", "cx"], coupling_map=[0, 1]).target
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it's worth to add an example with the new target containing CZ+RZZ gates

};

fn get_decomposers_from_target(
target: &Target,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps add a docstring explaining this function?

OperationRef::StandardGate(gate) => {
if matches!(
gate,
StandardGate::CX | StandardGate::CZ | StandardGate::ECR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that perhaps it's better to have these in some enum, in case we'll have more kak gates later on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also iswap should be added here?

@@ -511,6 +494,7 @@ fn get_2q_decomposer_from_basis(
decomposer: DecomposerType::TwoQubitControlledU(Box::new(decomposer)),
packed_op: PackedOperation::from_standard_gate(std_gate),
params: SmallVec::new(),
target_name: kak_gates[0].to_string(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we still use the hard coded kak gates? I also think we should add CU to the list of param_basis_names.

# copyright notice, and modified files need to carry a notice indicating
# that they have been altered from the originals.

"""Splits each two-qubit gate in the `dag` into two single-qubit gates, if possible without error."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still relevant here?

# pylint: disable=missing-function-docstring

"""
Tests for the default UnitarySynthesis transpiler pass.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnitarySynthesis --> TwoQubitPeepholeOptimization

),
name="bidirectional_{bidirectional}",
)
def test_coupling_map_transpile_with_backendv2(self, bidirectional):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests should have a docstring

@mtreinish
Copy link
Member Author

This is an extremely important PR. Do you have some benchmarks for the runtime?

I still need to do benchmarks, I think there are probably some performance optimizations available too.

There are a few issues regarding synthesis with RZZ/RXX gates, does this PR fix them? If so, we need to include some tests:
#14350
#13431
#13428 (comment) (this should already have a test)

I'll look at adding tests for these cases.

I also have a few comments and questions.
Should this pass be added as default for optimilzation_level=2 and 3? instead of UnitarySynthesis and ConsolidateBlocks?

We should do this in a follow up PR where we can look at that impact on the preset pass manager construction in isolation. Getting the pass right functional on its own is tricky enough as it is and I don't want to grow the scope of this PR even more. My plan though is to replace the use of consolidate blocks and unitary synthesis in the optimization stage for level 2 and 3 with this pass.

@mtreinish
Copy link
Member Author

mtreinish commented Jun 16, 2025

I ran some benchmarks using:

import time
import csv
import numpy as np

from qiskit.transpiler.passes.optimization.two_qubit_peephole import TwoQubitPeepholeOptimization
from qiskit.providers.fake_provider import GenericBackendV2
from qiskit.transpiler.coupling import CouplingMap
from qiskit.circuit.random import random_circuit
from qiskit.transpiler import PassManager
from qiskit.transpiler.passes import Collect2qBlocks, ConsolidateBlocks, UnitarySynthesis
from qiskit import transpile

cmap = CouplingMap.from_grid(5, 5)
backend = GenericBackendV2(cmap.size(), seed=2024, coupling_map=cmap)
peephole = TwoQubitPeepholeOptimization(backend.target)

legacy_path = PassManager(
    [
        ConsolidateBlocks(
            target=backend.target
        ),
        UnitarySynthesis(
            target=backend.target,
        ),
    ]
)


with open("peephole.csv", "w") as peep:
    peephole_writer = csv.writer(peep)
    peephole_writer.writerow(["input_depth", "output_depth", "output_2q_count", "output_total_count", "run_time"])
    with open("legacy.csv", "w") as legacy:
        legacy_writer = csv.writer(legacy)
        legacy_writer.writerow(["input_depth", "output_depth", "output_2q_count", "output_total_count", "run_time"])
        for depth in np.logspace(1, 6, dtype=int):
            qc = random_circuit(backend.num_qubits, depth, max_operands=2)
            qc.measure_all()
            qc = transpile(qc, backend, optimization_level=0, seed_transpiler=2025)
            input_depth = qc.depth()
            print(qc.count_ops())
            start_time = time.perf_counter()
            res = peephole(qc)
            stop_time = time.perf_counter()
            print(f"new pass parallel: {stop_time - start_time}")
            print(res.count_ops())
            peephole_writer.writerow([input_depth, res.depth(), res.size(lambda x: x.operation.num_qubits == 2), res.size(), stop_time - start_time])
            peep.flush()
            start_time = time.perf_counter()
            res = legacy_path.run(qc)
            stop_time = time.perf_counter()
            print(f"legacy path: {stop_time - start_time}")
            print(res.count_ops())
            legacy_writer.writerow([input_depth, res.depth(), res.size(lambda x: x.operation.num_qubits == 2), res.size(), stop_time - start_time])
            legacy.flush()

The resulting runtime graph looks good (but could be better) so I'll try to do some tuning:

peephole_runtime

what's a bit more concerning is the total gate count (ignore the y axis it's gate count not runtime):

peephole_gates

the 2q gate counts are nearly always the same (or sometimes slightly less) but there are a lot more single qubits gates.

@t-imamichi
Copy link
Member

t-imamichi commented Jun 17, 2025

I tried this pass and found that it emits RZ(0). It would be nice if it avoids RZ(0).

from qiskit import QuantumCircuit, __version__
from qiskit.providers.fake_provider import GenericBackendV2
from qiskit.transpiler import PassManager
from qiskit.transpiler.coupling import CouplingMap
from qiskit.transpiler.passes.optimization.two_qubit_peephole import TwoQubitPeepholeOptimization

print(f"qiskit_version={__version__}")
qc = QuantumCircuit(2)
qc.rzz(0.1, 0, 1)
cmap = CouplingMap.from_grid(2, 2)
backend = GenericBackendV2(cmap.size(), seed=2024, coupling_map=cmap)
peephole = PassManager([TwoQubitPeepholeOptimization(backend.target)])
t_qc = peephole.run(qc)
print(t_qc)

output

qiskit_version=2.2.0.dev0+b24088a
       ┌──────────┐ ┌───┐┌─────────┐               ┌───┐┌───┐ ┌─────────┐
q_0: ──┤ Rz(-π/2) ├─┤ X ├┤ Rz(0.1) ├───────────────┤ X ├┤ X ├─┤ Rz(π/2) ├──
     ┌─┴──────────┴┐└─┬─┘└──┬────┬─┘┌───────┐┌────┐└─┬─┘├───┤┌┴─────────┴─┐
q_1: ┤ Rz(-3.1374) ├──■─────┤ √X ├──┤ Rz(0) ├┤ √X ├──■──┤ X ├┤ Rz(3.1374) ├
     └─────────────┘        └────┘  └───────┘└────┘     └───┘└────────────┘

It would be nice if the outcome is as simple as follows.

qiskit_version=2.0.2
         ┌───┐┌─────────┐┌───┐
q_1 -> 2 ┤ X ├┤ Rz(0.1) ├┤ X ├
         └─┬─┘└─────────┘└─┬─┘
q_0 -> 3 ──■───────────────■──

@ShellyGarion
Copy link
Member

I see that this is the output of the TwoQubitBasisDecomposer:

decomp = TwoQubitBasisDecomposer(CXGate(), euler_basis='ZSXX')
print (decomp(Operator(qc)))

outputs:

     ┌────────────┐        ┌────┐  ┌───────┐┌────┐     ┌───┐┌─────────────┐
q_0: ┤ Rz(1.3593) ├──■─────┤ √X ├──┤ Rz(0) ├┤ √X ├──■──┤ X ├┤ Rz(-1.3593) ├
     └┬──────────┬┘┌─┴─┐┌──┴────┴─┐└───────┘└────┘┌─┴─┐├───┤└─┬─────────┬─┘
q_1: ─┤ Rz(-π/2) ├─┤ X ├┤ Rz(0.1) ├───────────────┤ X ├┤ X ├──┤ Rz(π/2) ├──
      └──────────┘ └───┘└─────────┘               └───┘└───┘  └─────────┘  

However, if we just use the BasisTranslator you get the correct decomposition:

basic = PassManager([BasisTranslator(sel, backend.target)])
t_qc = basic.run(qc)
print(t_qc)

outputs:

q_0: ──■───────────────■──
     ┌─┴─┐┌─────────┐┌─┴─┐
q_1: ┤ X ├┤ Rz(0.1) ├┤ X ├
     └───┘└─────────┘└───┘

perhaps it's worth to call the BasisTranslator before calling the decomposers? I think this can reduce the 1q gate counts.

@mtreinish
Copy link
Member Author

I tried this pass and found that it emits RZ(0). It would be nice if it avoids RZ(0).

I dug into this. It's an issue unrelated to this PR. There is a pulse optimal CX decomposer path in the TwoQubitBasisDecomposer classand in the code there was no checking of the angles being emitted. We had this logic in the euler 1q decomposer but not the pulse optimal path. I pushed this PR up #14629 to fix this

@t-imamichi
Copy link
Member

t-imamichi commented Jun 18, 2025

@mtreinish Thank you for removing RZ(0).
@ShellyGarion Yes, BasisTranslator does a good job for the case. However, I'm interested in #13431, and I checked whether the current PR can optimize the circuit or not.

mtreinish added 2 commits July 7, 2025 13:04
This commit adds a 3rd comparison to the heuristic used for comparing
the sequences. Previously it would first compare the 2q gate counts, and
pick the sequence with the fewest 2q gates, if the 2q gate counts are
the same then it would compute the expected error rate from the sequence and pick the sequence with the best expected fidelity. This commit updates this so if the expected fidelity are the same it will pick the sequence with the least number of gates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: New Feature Include in the "Added" section of the changelog mod: transpiler Issues and PRs related to Transpiler performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add parallel synthesis interface to default unitary synthesis plugin ConsolidateBlocks does not have a good logic for heterogeneous gates
6 participants