Skip to content

Conversation

JAicewizard
Copy link
Contributor

When the rhs of an integer divide is known to be constant, it is possible to optimize this. Libdivide does this optimization at runtime, so even when the constant isn't known at compile time, it can still perform similar optimizations.

As you can see below the code using libdivide runs up to twice as fast. It should be even faster for unsigned integers.
It might be possible for the compiler to auto-vectorise the branchless version of libdivide, but I have not looked into that. Open for future research.

Issues with this code

I personally find the current method a bit messy, as it is the responsibility of the OPWRAPPER to decide what to optimize and what not. On one hand this allows the OP to be as generic as it was before. On the other hand, adding this libdivide optimization to more types will result in lots of duplicate code across the wrappers.

Benchmarks (ran with turbo-boost disabled)

No optimisation:

name	run	timing
benchmark/micro/arithmetic/division_constrhs.benchmark	1	1.170270
benchmark/micro/arithmetic/division_constrhs.benchmark	2	1.173304
benchmark/micro/arithmetic/division_constrhs.benchmark	3	1.164200
benchmark/micro/arithmetic/division_constrhs.benchmark	4	1.179625
benchmark/micro/arithmetic/division_constrhs.benchmark	5	1.177252

Just performing row validity check based on the rhs (if possible):

name	run	timing
benchmark/micro/arithmetic/division_constrhs.benchmark	1	1.113453
benchmark/micro/arithmetic/division_constrhs.benchmark	2	1.130460
benchmark/micro/arithmetic/division_constrhs.benchmark	3	1.125331
benchmark/micro/arithmetic/division_constrhs.benchmark	4	1.134362
benchmark/micro/arithmetic/division_constrhs.benchmark	5	1.120623

Using libdivide:

name	run	timing
benchmark/micro/arithmetic/division_constrhs.benchmark	1	0.498989
benchmark/micro/arithmetic/division_constrhs.benchmark	2	0.497163
benchmark/micro/arithmetic/division_constrhs.benchmark	3	0.498707
benchmark/micro/arithmetic/division_constrhs.benchmark	4	0.504418
benchmark/micro/arithmetic/division_constrhs.benchmark	5	0.502800

@JAicewizard
Copy link
Contributor Author

Builds are all failing seemingly because of formatting issues.
As mentioned the biggest issue I have with the code is that the responsibility of optimization is with the wrapper not the operation itself. As I am not very familiar with c++ templates, I don't really know how to improve this.
However if this isn't an issue on your side, I can remove the WIP and fix the formatting issues.

Similar optimisations can be done for the modulo operator. I will file a seperate PR for that once this is merged.

Tagging @lnkuiper since I already mentioned this to him on monday

@github-actions github-actions bot marked this pull request as draft January 26, 2024 11:43
@JAicewizard JAicewizard marked this pull request as ready for review January 26, 2024 11:56
@JAicewizard JAicewizard changed the title WIP: Optimise division by a constant at runtime for integer division Optimise division by a constant at runtime for integer division Jan 26, 2024
@github-actions github-actions bot marked this pull request as draft January 26, 2024 12:03
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Great performance results.

  • Can we create a separate divide_by_const function, and then rewrite x // C to divide_by_const(x, C) as an optimization in the ArithmeticSimplificationRule, instead of doing this optimization within the division operator?
  • We avoid explicit SIMD instructions in DuckDB - and libdivide seems to contain many of them. In general it seems like a very complex library for what should be a relatively simple optimization. I wonder if we could either strip down libdivide, or switch to something like fastmod which seems to be a lot more simple.

@JAicewizard
Copy link
Contributor Author

Can we create a separate divide_by_const function, and then rewrite x // C to divide_by_const(x, C) as an optimization in the ArithmeticSimplificationRule, instead of doing this optimization within the division operator?

I don't know! I don't know a lot of the internals of duckdb. I can write a function that does this, but I am not sure I can also do the optimization part of it. This does however seam like a much cleaner solution.

We avoid explicit SIMD instructions in DuckDB - and libdivide seems to contain many of them. In general it seems like a very complex library for what should be a relatively simple optimization. I wonder if we could either strip down libdivide, or switch to something like fastmod which seems to be a lot more simple.

I havn't yet tested fastmod, that was the library I was intending to use for the follow-up using modulo. One advantage of this library is that it also provides branchless versions, which may prove advantageous for automatic vectorisation. I also don't know the performance of fastmod for division
All the explicit vectorisation can be removed I think, I will look into that.

@Mytherin
Copy link
Collaborator

Can we create a separate divide_by_const function, and then rewrite x // C to divide_by_const(x, C) as an optimization in the ArithmeticSimplificationRule, instead of doing this optimization within the division operator?

I don't know! I don't know a lot of the internals of duckdb. I can write a function that does this, but I am not sure I can also do the optimization part of it. This does however seam like a much cleaner solution.

Have a look here, I think the rewrite should be relatively straightforward.

@JAicewizard
Copy link
Contributor Author

I implemented it as a function instead of an operator. I wasn't entirely sure where to put it, but it can easily be moved of course.

I also looked into using fastmod for division, however it only supports 32 and 64 bit integers, and doesnt support 64 bit division on MSVC at all. I also measured the performance, and it is significantly slower.

@JAicewizard JAicewizard force-pushed the optimise_intdiv branch 2 times, most recently from 355b92e to fb06827 Compare January 28, 2024 13:11
@JAicewizard JAicewizard marked this pull request as ready for review January 28, 2024 13:12
@github-actions github-actions bot marked this pull request as draft January 28, 2024 21:08
@JAicewizard JAicewizard marked this pull request as ready for review January 28, 2024 21:08
@github-actions github-actions bot marked this pull request as draft January 28, 2024 21:15
@JAicewizard JAicewizard marked this pull request as ready for review January 28, 2024 21:16
@github-actions github-actions bot marked this pull request as draft January 28, 2024 21:21
@JAicewizard JAicewizard marked this pull request as ready for review January 28, 2024 21:22
@github-actions github-actions bot marked this pull request as draft January 31, 2024 14:09
@JAicewizard
Copy link
Contributor Author

I moved this PR to use fastmod. This reduced the available types this optimization applies to, to uint32_t, but performance is around the same.

@JAicewizard
Copy link
Contributor Author

I reimplemented fastmod using the duckdb 128 bit types to allow unsigned as well as signed 32 bit execution. It is even possible to implement the unsigned 64 bit variant using this.

To get the performance somewhat good I needed the fast-path of the hugeint multiply to be in the header, as otherwise the compiler cant optimize this. I left the slow path in the c++ file, but in case of a modern gcc or clang compiler, calling the multiply function will use the optimized bath and be inlined. This makes the multiply a lot faster (if used directly, not via *).

If this looks good I can easily implement this for uint64 and (u)int{8/16}

@JAicewizard JAicewizard requested a review from Mytherin February 14, 2024 12:31
@JAicewizard JAicewizard marked this pull request as ready for review February 14, 2024 12:32
@github-actions github-actions bot marked this pull request as draft February 14, 2024 12:41
@JAicewizard JAicewizard marked this pull request as ready for review February 14, 2024 12:55
@duckdb-draftbot duckdb-draftbot marked this pull request as draft August 23, 2024 18:18
@JAicewizard JAicewizard marked this pull request as ready for review August 23, 2024 18:24
@duckdb-draftbot duckdb-draftbot marked this pull request as draft August 23, 2024 18:36
@JAicewizard JAicewizard marked this pull request as ready for review August 23, 2024 18:36
@duckdb-draftbot duckdb-draftbot marked this pull request as draft August 23, 2024 18:42
@JAicewizard JAicewizard marked this pull request as ready for review August 23, 2024 18:42
@duckdb-draftbot duckdb-draftbot marked this pull request as draft August 23, 2024 18:50
@JAicewizard JAicewizard marked this pull request as ready for review August 23, 2024 18:51
@duckdb-draftbot duckdb-draftbot marked this pull request as draft August 23, 2024 18:55
@JAicewizard JAicewizard marked this pull request as ready for review August 23, 2024 18:55
@duckdb-draftbot duckdb-draftbot marked this pull request as draft August 23, 2024 19:12
@JAicewizard JAicewizard marked this pull request as ready for review August 23, 2024 19:20
@JAicewizard
Copy link
Contributor Author

@Mytherin @lnkuiper It's been a while, but I re-based and all the test now pass (not sure why they failed before). Are there any more changes I should make?

@Mytherin Mytherin changed the base branch from main to feature August 26, 2024 07:30
@Mytherin
Copy link
Collaborator

Thanks for the changes! I think this is actually ready to merge, @lnkuiper thoughts?

Copy link
Contributor

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mytherin I agree, this looks good!

@Mytherin Mytherin merged commit c447522 into duckdb:feature Aug 26, 2024
43 checks passed
@JAicewizard JAicewizard deleted the optimise_intdiv branch August 26, 2024 12:21
@JAicewizard
Copy link
Contributor Author

Thanks for the merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants