Skip to content

Conversation

kevinhartman
Copy link
Contributor

@kevinhartman kevinhartman commented Nov 25, 2024

Summary

Adds a new Rust enum StandardInstruction representation for standard instructions (i.e. Barrier, Delay, Measure and Reset) and updates the encoding of PackedOperation to support storing it compactly at rest.

The size of a PackedOperation has now been anchored to always be 64 bits , even on 32-bit systems EDIT: we no longer support 32-bit anyway. This is necessary to encode the data payload of standard instructions like Barrier and Delay. See the docstrings within packed_instruction.rs for details.

Details and comments

The implementation works very similarly to what we currently do for StandardGate, but with a bit of extra consideration for an additional data payload used by variadic/template instructions like Barrier and Delay.

Similarly to StandardGate, the newly added StandardInstruction enum serves as the first-class Rust interface for working with standard instructions. The existing OperationRef enum returned by PackedOperation::view has a new variant StandardInstruction which exposes this.

Unlike StandardGate, StandardInstruction is not a pyclass because it is not directly used to tag instructions as standard in our Python class definitions. Instead, the simple integral enum StandardInstructionType is used as the tag, and OperationFromPython::extract_bound further queries the incoming Python object for variadic/template data when building the complete StandardInstruction.

Encoding

The PackedOperation encoding has been updated as follows:

  • A PackedOperation is now always 64 bits wide
  • The discriminant has been widened from 2 to 3 bits (it is now at its maximum width, but we still have room for 3 more variants).
  • The discriminant now has additional variant StandardInstruction.
  • A packed standard gate's enum is now byte-aligned.

@jakelishman
Copy link
Member

jakelishman commented Nov 25, 2024

I'm concerned that the LoHi struct and PackedOperation union are making it easier to make mistakes on BE and/or 32-bit systems, because the encodings now change between them.

I feel like the bitshift and masking operations could be extended over the whole u64, and that'll all automatically work on BE and 32-bit systems, especially with the size of PackedOperation now fixed to 64 bits - we can even have everything be aligned. The shifts of the payload of instructions can be set such that a u32 payload is always in the high bits, with the padding bits between it and the rest of the discriminants if you're concerned about aligned access, because then the bitshifts will be compiled out into a single register load (just read the u32 from the right place) and the compiler will handle endianness for us. The pointer doesn't need to be handled any differently to any other payload - everything's just shifts and masks anyway (LoHi can't avoid that), and I think introducing the union makes that harder to follow.

The LoHi struct to me seems to be forcing every method of PackedOperation to think about the partial register reads/loads. If we use shifts and masks on a u64 with const masks and shifts, there's no logic split where some of it is done with shifts and some with endianness-dependent partial register access in our own code.

@jakelishman
Copy link
Member

jakelishman commented Nov 25, 2024

I guess my main point is this: the 64-bit PackedOperation can always be thought about as a collection of payloads each of which is stored in a particular mask and needs a particular shift. These are:

  • the PackedOperation discriminant at (0b111, 0)
  • a pointer at (usize::MAX as u64 & 0b000, 0) (regardless of 32-bit or 64-bit)
  • a StandardGate discriminant at (0xff << 3, 3)
  • a StandardInstruction discriminant at (0xf << 3, 3)
  • a DelayUnit payload at (for example) (0x7 << 32, 32)
  • a u32 payload at (u32::MAX as u64 << 32, 32)

Introducing LoHi doesn't remove the need to bitshift and mask for most items, it just means that some of my above list are done one way, some are done another, and the union means that the pointer now has more ways to access it. LoHi also restricts the payload size to u32, when we already have payloads that exceed that (the pointers).

If you want a shade more encapsulation than my loose const shift and mask associated items, would something like this look better to you?

struct PayloadLocation<T: SomeBytemuckOrCustomTrait> {
    mask: u64,
    shift: u32,
    datatype: PhantomData<T>,
}
impl PayloadLocation {
    // Note we _mustn't_ accept `PackedOperation` until we've got a valid one,
    // because holding a `PackedOperation` implies ownership over its stored
    // pointer, and it mustn't `Drop` or attempt to interpret partial data.

    #[inline]
    fn store(&self, src: T, target: u64) -> u64 {
        let src: u64 = bytemuck::cast(src);
        target | ((src << self.shift) & self.mask)
    }
    #[inline]
    fn load(&self, target: u64) -> T {
        bytemuck::cast((target & self.mask) >> self.shift)
    }
}

const PACKED_OPERATION_DISCRIMINANT = PayloadLocation { mask: 0b111, shift: 0 };
const STANDARD_GATE_DISCRIMINANT = PayloadLocation { mask: 0bff << 3, shift: 3 };
const POINTER_PAYLOAD = PayloadLocation { mask: usize::MAX as u64 & 0b000, shift: 0 };
// and so on

(Bear in mind I just typed that raw without trying it, and I didn't think all that hard about what the trait bound should be.)

If everything's inlined and constant, the compiler absolutely should be able to compile out 32 bit shifts and all-ones masks into partial register reads/loads itself, so there oughtn't to be any overhead.

Trying out a neat crate for Rust bitfields. The caveats are:

* Custom enums used in the bitfield must specify const funcs for bit
  conversion, and bytemuck's cast functions aren't const.
* The bitfield crate implements Clone for you, but does not seem to have
  a way to disable this. We can't rely on their clone, since for pointer
  types we need to allocate a new Box. To get around this,
  PackedOperation is a wrapper around an internal bitfield struct
  (rather than being a bitfield struct itself).

(Note: I'm not yet happy with this. Specifically, I think the
abstraction may be cleaner if OpBitField is defined entirely in terms of
raw native types and any reinterpretation / transmutation is done from
PackedOperation. Consider this a first pass.)
@1ucian0 1ucian0 added the Rust This PR or issue is related to Rust code in the repository label Dec 11, 2024
Copy link
Member

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this - this latest implementation looks very nice and clean to me. I have a couple of questions about the private modules inside packed_instruction, but they're relatively minor. In general for the implementation I'm happy with this and would like to get it merged sooner rather than later.

It would be really good to get a benchmark comparison for this before we merge, to check there's no expected problems.

Comment on lines +344 to +350
unsafe impl ::bytemuck::CheckedBitPattern for StandardInstructionType {
type Bits = u8;

fn is_valid_bit_pattern(bits: &Self::Bits) -> bool {
*bits < 4
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you've thought of a nicer way to do this? I know we've got a few instances of this kind of check in the library by now, and I'm getting more nervous about making mistakes in them, especially since we already did it once).

No need to change this PR about it (and if we do have any ideas, let's do it in a follow-up anyway), just asking if anything's occurred to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that has loosely occurred to me here (other than using nightly features) is to define these enum types with some kind of recursive macro that generates each variant's index and implements CheckedBitPattern after visiting all members. It feels like that would be possible, but I'm not well-versed in Rust macros yet, so I could be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah no worries - I also had no better ideas, I'm just really hoping someone comes up with something workable haha.

We could write a proc macro to do it, but I don't know what effect that'll have on our build times. I suppose we're already in that boat via PyO3?

Comment on lines +34 to +39
def __init_subclass__(cls, **kwargs):
super().__init_subclass__(**kwargs)
# Subclasses of Measure are not "standard", so we set this to None to
# prevent the Rust code from treating them as such.
cls._standard_instruction_type = None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to have to revisit this in the future to more properly support specialised subclasses of standard instructions - we're almost certain to miss things in the compiler if we're encoding two different things that are actually measures in two very different forms (StandardInstruction and Instruction). That said, we already have that problem with the subclassing, so I don't think this isn't going to cause any new problems.

I mostly just mean that I think the need to do this is indicative that we don't have the story properly formalised yet, which was true before anyway.

Comment on lines +133 to +134
/// A private module to encapsulate the encoding of [StandardGate].
mod standard_gate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor question, but what encapsulation do you mean here? The trait implementations aren't private, because visibility is tied to the trait and the types involved (and everything here is public), and the structs aren't public because they're not exported anyway.

I'm fine if it's just for logical grouping within the file, I'm just confused by the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the comment since it's likely self-documenting anyway and therefore confusing to readers. I really only meant that StandardInstructionBits and our reliance on bitfield_struct is private to the inline standard_gate module. I was trying to emphasize this to show that the new implementation shouldn't lock us in to using an external crate, if it becomes inconvenient.

@jakelishman
Copy link
Member

Ah, I see that #13759 is actually built on top of this (I hadn't looked at it properly yet), so my concerns about the Drop impl aren't really realised in the same way.

Putting the Drop impl inside mod pointer {} still feels wrong to me, and I'd rather we didn't do it, but given that the way UnitaryGate is implemented in Rust doesn't actually have a problem with it, it's not as critical to change.

Also uses codegen for From<Box<T>> explicitly for all T,
rather than a blanket impl on the private PackablePointer
trait to ensure proper docs are generated.
@kevinhartman
Copy link
Contributor Author

Thanks for the review!

Putting the Drop impl inside mod pointer {} still feels wrong to me, and I'd rather we didn't do it, but given that the way UnitaryGate is implemented in Rust doesn't actually have a problem with it, it's not as critical to change.

As mentioned in a few of my recent comment replies, this should be resolved now 🙂.

Separately, I'll run some benchmarks for this against main, and then against a copy that reverts the layout change to a packed StandardGate. I sort of doubt the compiled output will be any different, tbh, but you're right we should at least make sure it isn't worse.

Copy link
Member

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice nice nice, yeah, that last change to the Drop is nice - pointer-y stuff can stay in the pointer code group, but Drop itself isn't hidden.

Thanks for all this - I think it's good just to verify with the benchmarks, but assuming no regressions, I'm happy to merge this now. I'll tick once we've got the benchmarks.

@kevinhartman
Copy link
Contributor Author

There were a few benchmarks that got worse, so I switched us back to using the same padding for StandardGate for the sake of isolation, even though I don't believe the pad difference is responsible. These benchmarks compare the merge-base of main and this branch, before I added a few more #[inline] directives (these ended up doing nothing for the smaller sample of tests I ran locally, and I'd rather not run all the circuit construction and pass tests again).

Here's what improved:

Change Before [8095ace] <standard-op^2> After [c85be7a] <standard-op~1> Ratio Benchmark (Parameter)
- 44.6±3μs 39.0±1μs 0.87 circuit_construction.CircuitConstructionBench.time_circuit_extend(2, 128)
- 56.1±5μs 45.2±2μs 0.81 circuit_construction.CircuitConstructionBench.time_circuit_extend(8, 128)
- 40.3±1ms 35.9±0.7ms 0.89 passes.MultiQBlockPassBenchmarks.time_collect_multiq_block(5, 1024, 1)
- 736±20ms 662±7ms 0.9 passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(14, 1024, ['u', 'cx', 'id'])
- 37.5±0.7μs 34.0±0.7μs 0.91 passes.PassBenchmarks.time_remove_diagonal_gates_before_measurement(5, 1024)

And here's what got worse:

Change Before [8095ace] <standard-op^2> After [c85be7a] <standard-op~1> Ratio Benchmark (Parameter)
+ 40.7±2ms 47.2±2ms 1.16 mapping_passes.PassBenchmarks.time_apply_layout(14, 1024)
+ 52.1±2μs 61.7±1μs 1.18 passes.PassBenchmarks.time_remove_barriers(14, 1024)
+ 64.1±0.6μs 75.1±3μs 1.17 passes.PassBenchmarks.time_remove_barriers(20, 1024)
+ 32.7±0.2μs 36.8±0.4μs 1.12 passes.PassBenchmarks.time_remove_barriers(5, 1024)
+ 1.55±0.03ms 1.80±0.05ms 1.16 passes.PassBenchmarks.time_remove_reset_in_zero_state(14, 1024)
+ 634±10μs 713±70μs 1.12 passes.PassBenchmarks.time_remove_reset_in_zero_state(5, 1024)
+ 420±20μs 531±7μs 1.27 passes.PassBenchmarks.time_unroll_3q_or_more(5, 1024)

I believe the passes that have gotten worse are all Python-based and do things like iterate over the DAG (which requires Python Barrier, Delay, Reset, and Measure to be re-inflated). So I don't think there's anything to be concerned about here, though I'm happy to dig further if anyone feels differently.

@jakelishman
Copy link
Member

Yeah I'd agree - all those passes that got worse look to be because we've just shifted the Python-space construction cost into the passes themselves. The total time is probably the same, just some cost moved from setup to benchmark. I also highly doubt it's anything to do with StandardGate changes, but since it's not proven safe, it's good to move it, and we can just do that move in a separate PR if you still want it, where it'd be easier to revert if the overnight benchmarking shows a regression.

Copy link
Member

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this!

@jakelishman jakelishman enabled auto-merge February 5, 2025 16:32
@jakelishman jakelishman added this pull request to the merge queue Feb 5, 2025
Merged via the queue into Qiskit:main with commit 73409fb Feb 5, 2025
17 checks passed
wshanks added a commit to wshanks/qiskit-experiments that referenced this pull request Feb 8, 2025
A long-standing bug in qpy leads to the units of delay instructions
being overwritten to `dt`. See:

Qiskit/qiskit#10662

Some experiments test that their circuits are round-trip serializable
through qpy (perhaps not as necessary as it once was as the room for
sticking arbitrary Python into circuits is reduced with the move to
Rust). The T1 and Tphi circuits have delays and so were affected by the
qpy bug. However, there is a second bug related to delays which canceled
this bug out for those tests. `Delay.__eq__` does not check that the
units are the same for the two delay instructions, so the loss of the
original unit did not matter for the tests. Following this change:

Qiskit/qiskit#13486

Comparing circuits with delays now includes the delay units in the
determination of equality. `Delay.__eq__` itself still does not consider
the unit but some part of adding the delay to a circuit converts it into
a Rust object that does check the unit.

Here the tests are given a backend with a `dt` value so that the
experiments will generate circuits with delays in `dt` and so still have
delays with the same `dt` after the qpy roundtrip.
github-merge-queue bot pushed a commit to qiskit-community/qiskit-experiments that referenced this pull request Feb 11, 2025
…#1509)

A long-standing bug in qpy leads to the units of delay instructions
being overwritten to `dt`. See:

Qiskit/qiskit#10662

Some experiments test that their circuits are round-trip serializable
through qpy (perhaps not as necessary as it once was as the room for
sticking arbitrary Python into circuits is reduced with the move to
Rust). The T1 and Tphi circuits have delays and so were affected by the
qpy bug. However, there is a second bug related to delays which canceled
this bug out for those tests. `Delay.__eq__` does not check that the
units are the same for the two delay instructions, so the loss of the
original unit did not matter for the tests. Following this change:

Qiskit/qiskit#13486

Comparing circuits with delays now includes the delay units in the
determination of equality. `Delay.__eq__` itself still does not consider
the unit but some part of adding the delay to a circuit converts it into
a Rust object that does check the unit.

Here the tests are given a backend with a `dt` value so that the
experiments will generate circuits with delays in `dt` and so still have
delays with the same `dt` after the qpy roundtrip.
wshanks added a commit to wshanks/qiskit-experiments that referenced this pull request Feb 17, 2025
…qiskit-community#1509)

A long-standing bug in qpy leads to the units of delay instructions
being overwritten to `dt`. See:

Qiskit/qiskit#10662

Some experiments test that their circuits are round-trip serializable
through qpy (perhaps not as necessary as it once was as the room for
sticking arbitrary Python into circuits is reduced with the move to
Rust). The T1 and Tphi circuits have delays and so were affected by the
qpy bug. However, there is a second bug related to delays which canceled
this bug out for those tests. `Delay.__eq__` does not check that the
units are the same for the two delay instructions, so the loss of the
original unit did not matter for the tests. Following this change:

Qiskit/qiskit#13486

Comparing circuits with delays now includes the delay units in the
determination of equality. `Delay.__eq__` itself still does not consider
the unit but some part of adding the delay to a circuit converts it into
a Rust object that does check the unit.

Here the tests are given a backend with a `dt` value so that the
experiments will generate circuits with delays in `dt` and so still have
delays with the same `dt` after the qpy roundtrip.
mergify bot pushed a commit to qiskit-community/qiskit-experiments that referenced this pull request Feb 25, 2025
…#1509)

A long-standing bug in qpy leads to the units of delay instructions
being overwritten to `dt`. See:

Qiskit/qiskit#10662

Some experiments test that their circuits are round-trip serializable
through qpy (perhaps not as necessary as it once was as the room for
sticking arbitrary Python into circuits is reduced with the move to
Rust). The T1 and Tphi circuits have delays and so were affected by the
qpy bug. However, there is a second bug related to delays which canceled
this bug out for those tests. `Delay.__eq__` does not check that the
units are the same for the two delay instructions, so the loss of the
original unit did not matter for the tests. Following this change:

Qiskit/qiskit#13486

Comparing circuits with delays now includes the delay units in the
determination of equality. `Delay.__eq__` itself still does not consider
the unit but some part of adding the delay to a circuit converts it into
a Rust object that does check the unit.

Here the tests are given a backend with a `dt` value so that the
experiments will generate circuits with delays in `dt` and so still have
delays with the same `dt` after the qpy roundtrip.

(cherry picked from commit ad2341d)
mergify bot pushed a commit to qiskit-community/qiskit-experiments that referenced this pull request Feb 25, 2025
…#1509)

A long-standing bug in qpy leads to the units of delay instructions
being overwritten to `dt`. See:

Qiskit/qiskit#10662

Some experiments test that their circuits are round-trip serializable
through qpy (perhaps not as necessary as it once was as the room for
sticking arbitrary Python into circuits is reduced with the move to
Rust). The T1 and Tphi circuits have delays and so were affected by the
qpy bug. However, there is a second bug related to delays which canceled
this bug out for those tests. `Delay.__eq__` does not check that the
units are the same for the two delay instructions, so the loss of the
original unit did not matter for the tests. Following this change:

Qiskit/qiskit#13486

Comparing circuits with delays now includes the delay units in the
determination of equality. `Delay.__eq__` itself still does not consider
the unit but some part of adding the delay to a circuit converts it into
a Rust object that does check the unit.

Here the tests are given a backend with a `dt` value so that the
experiments will generate circuits with delays in `dt` and so still have
delays with the same `dt` after the qpy roundtrip.

(cherry picked from commit ad2341d)
wshanks added a commit to qiskit-community/qiskit-experiments that referenced this pull request Feb 25, 2025
… (backport #1509) (#1520)

A long-standing bug in qpy leads to the units of delay instructions
being overwritten to `dt`. See:

Qiskit/qiskit#10662

Some experiments test that their circuits are round-trip serializable
through qpy (perhaps not as necessary as it once was as the room for
sticking arbitrary Python into circuits is reduced with the move to
Rust). The T1 and Tphi circuits have delays and so were affected by the
qpy bug. However, there is a second bug related to delays which canceled
this bug out for those tests. `Delay.__eq__` does not check that the
units are the same for the two delay instructions, so the loss of the
original unit did not matter for the tests. Following this change:

Qiskit/qiskit#13486

Comparing circuits with delays now includes the delay units in the
determination of equality. `Delay.__eq__` itself still does not consider
the unit but some part of adding the delay to a circuit converts it into
a Rust object that does check the unit.

Here the tests are given a backend with a `dt` value so that the
experiments will generate circuits with delays in `dt` and so still have
delays with the same `dt` after the qpy roundtrip.<hr>This is an
automatic backport of pull request #1509 done by
[Mergify](https://mergify.com).

Co-authored-by: Will Shanks <willshanks@us.ibm.com>
@ElePT ElePT added Changelog: New Feature Include in the "Added" section of the changelog Changelog: None Do not include in changelog and removed Changelog: New Feature Include in the "Added" section of the changelog labels Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog performance priority: high Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add rust representation for non-gate standard circuit operations
7 participants