Skip to content

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented Jun 2, 2025

What this PR does

This PR implements an optimisation inspired by prometheus/prometheus#14097.

I've implemented this as a MQE optimisation pass (rather than reusing the logic from that Prometheus PR) for two reasons:

  • It helps prove out the structure of optimisation passes, and demonstrate they're applicable to more than just common subexpression elimination.
  • We need to consider how this optimisation interacts with common subexpression elimination. Specifically, we need to consider all the different paths to a selector after common subexpression elimination has been performed to know if we can skip decoding histogram buckets. Alternatively, if we were to reuse the logic from Prometheus, the common subexpression elimination pass would need to know how to merge selectors that are identical save for the SkipHistogramBuckets hint.

Peak memory consumption is reduced by ~10% in our benchmarks. Latency improves by up to 25% in some cases, but regresses by a small amount (in both absolute and relative terms) for some instant queries:

goos: darwin
goarch: arm64
pkg: github.com/grafana/mimir/pkg/streamingpromql/benchmarks
cpu: Apple M1 Pro
                                                                                                            │ original.txt │             final.txt              │
                                                                                                            │    sec/op    │   sec/op     vs base               │
Query/histogram_count(sum(rate(nh_1[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                      244.8µ ± 4%   236.4µ ± 1%   -3.40% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10         291.7µ ± 4%   261.4µ ± 2%  -10.40% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10        791.2µ ± 1%   627.4µ ± 2%  -20.70% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                    7.719m ± 1%   7.948m ± 0%   +2.97% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10       11.85m ± 1%   10.64m ± 1%  -10.25% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10      54.10m ± 3%   41.61m ± 2%  -23.08% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                   146.2m ± 5%   150.3m ± 0%        ~ (p=0.065 n=6)
Query/histogram_count(sum(rate(nh_2000[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10      223.9m ± 3%   202.3m ± 0%   -9.63% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10    1047.5m ± 3%   789.9m ± 1%  -24.60% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                     264.6µ ± 4%   243.7µ ± 1%   -7.92% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10        359.7µ ± 4%   320.3µ ± 2%  -10.96% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10       1.295m ± 3%   1.089m ± 0%  -15.92% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                   8.048m ± 1%   8.472m ± 1%   +5.26% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10      16.53m ± 2%   15.48m ± 4%   -6.34% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10    100.81m ± 1%   84.79m ± 0%  -15.89% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                  151.2m ± 0%   160.0m ± 0%   +5.83% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10     314.8m ± 0%   295.9m ± 1%   -6.00% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10     1.968 ± 1%    1.660 ± 0%  -15.64% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                        244.1µ ± 2%   230.4µ ± 2%   -5.60% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10           289.1µ ± 2%   261.0µ ± 1%   -9.70% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10          783.2µ ± 1%   623.1µ ± 0%  -20.44% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                      7.820m ± 2%   7.937m ± 1%   +1.49% (p=0.041 n=6)
Query/histogram_sum(sum(rate(nh_100[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10         11.95m ± 1%   10.57m ± 1%  -11.52% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10        54.67m ± 1%   41.65m ± 0%  -23.83% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                     146.8m ± 1%   150.1m ± 0%   +2.27% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10        224.9m ± 0%   200.4m ± 1%  -10.90% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10      1050.1m ± 1%   790.5m ± 1%  -24.72% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                       258.7µ ± 2%   243.4µ ± 2%   -5.91% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10          353.7µ ± 3%   317.4µ ± 2%  -10.27% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10         1.290m ± 1%   1.084m ± 1%  -15.93% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                     8.085m ± 0%   8.471m ± 1%   +4.78% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10        16.58m ± 6%   15.49m ± 1%   -6.59% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10      101.78m ± 3%   84.98m ± 2%  -16.50% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                    152.6m ± 2%   160.2m ± 2%   +4.99% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10       317.7m ± 1%   296.4m ± 1%   -6.70% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10       1.980 ± 1%    1.654 ± 0%  -16.49% (p=0.002 n=6)
geomean                                                                                                        15.01m        13.55m        -9.76%

                                                                                                            │ original.txt │              final.txt              │
                                                                                                            │      B       │      B        vs base               │
Query/histogram_count(sum(rate(nh_1[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                     76.69Mi ± 5%   68.97Mi ± 5%  -10.07% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10        74.16Mi ± 2%   66.49Mi ± 1%  -10.34% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10       73.48Mi ± 1%   64.20Mi ± 2%  -12.63% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                   71.55Mi ± 1%   62.59Mi ± 3%  -12.52% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10      69.62Mi ± 1%   62.04Mi ± 2%  -10.89% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10     71.49Mi ± 1%   63.12Mi ± 1%  -11.71% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                  75.71Mi ± 2%   67.80Mi ± 2%  -10.44% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10     75.67Mi ± 2%   67.84Mi ± 2%  -10.36% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10    79.64Mi ± 1%   72.62Mi ± 1%   -8.81% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                    75.88Mi ± 2%   67.17Mi ± 2%  -11.48% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10       73.75Mi ± 2%   66.02Mi ± 4%  -10.49% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_1[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10      73.00Mi ± 1%   64.12Mi ± 2%  -12.16% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                  70.35Mi ± 2%   63.04Mi ± 2%  -10.39% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10     69.81Mi ± 1%   61.73Mi ± 1%  -11.57% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_100[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10    71.08Mi ± 1%   62.70Mi ± 1%  -11.78% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                 74.77Mi ± 3%   67.80Mi ± 3%   -9.32% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10    75.47Mi ± 1%   66.90Mi ± 2%  -11.36% (p=0.002 n=6)
Query/histogram_count(sum(rate(nh_2000[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10   79.35Mi ± 1%   70.52Mi ± 2%  -11.14% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                       76.41Mi ± 2%   69.45Mi ± 4%   -9.10% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10          75.35Mi ± 3%   66.16Mi ± 2%  -12.19% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10         72.92Mi ± 1%   63.72Mi ± 1%  -12.62% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                     71.88Mi ± 2%   62.25Mi ± 1%  -13.40% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10        70.02Mi ± 1%   61.74Mi ± 1%  -11.83% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10       71.70Mi ± 1%   63.26Mi ± 1%  -11.78% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[2m]))),_instant_query/engine=MimirWithQueryPlanner-10                    74.36Mi ± 2%   66.23Mi ± 2%  -10.94% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[2m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10       75.64Mi ± 1%   67.31Mi ± 1%  -11.01% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[2m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10      79.44Mi ± 1%   72.38Mi ± 1%   -8.89% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                      75.46Mi ± 4%   67.14Mi ± 2%  -11.03% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10         73.41Mi ± 3%   65.93Mi ± 3%  -10.19% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_1[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10        72.95Mi ± 1%   63.78Mi ± 1%  -12.57% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                    71.73Mi ± 2%   63.17Mi ± 2%  -11.93% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10       69.86Mi ± 1%   61.66Mi ± 1%  -11.73% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_100[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10      71.34Mi ± 1%   62.35Mi ± 2%  -12.60% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[20m]))),_instant_query/engine=MimirWithQueryPlanner-10                   75.13Mi ± 1%   67.05Mi ± 3%  -10.75% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[20m]))),_range_query_with_100_steps/engine=MimirWithQueryPlanner-10      75.03Mi ± 1%   67.27Mi ± 2%  -10.35% (p=0.002 n=6)
Query/histogram_sum(sum(rate(nh_2000[20m]))),_range_query_with_1000_steps/engine=MimirWithQueryPlanner-10     78.70Mi ± 2%   70.84Mi ± 1%   -9.98% (p=0.002 n=6)
geomean                                                                                                       73.92Mi        65.69Mi       -11.13%

⚠️ This PR is pending prometheus/prometheus#16686 being merged and vendored into Mimir.

Which issue(s) this PR fixes or relates to

#10067

Checklist

  • Tests updated.
  • [n/a] Documentation added.
  • [covered by Mimir Query Engine #10067] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • [n/a] about-versioning.md updated with experimental features.

Copy link
Contributor

github-actions bot commented Jun 2, 2025

💻 Deploy preview deleted.

@charleskorn
Copy link
Contributor Author

This PR is pending prometheus/prometheus#16686 being merged and vendored into Mimir.

@charleskorn charleskorn force-pushed the charleskorn/mqe-skip-histogram-decoding branch from 0369ebb to 983a970 Compare June 26, 2025 07:11
@charleskorn charleskorn added the changelog-not-needed PRs that don't need a CHANGELOG.md entry label Jun 26, 2025
@charleskorn
Copy link
Contributor Author

This PR is pending prometheus/prometheus#16686 being merged and vendored into Mimir.

The Prometheus PR has been merged into Prometheus and vendored into Mimir, so this PR is ready for review.

@charleskorn charleskorn marked this pull request as ready for review June 27, 2025 00:56
@charleskorn charleskorn requested review from tacole02 and a team as code owners June 27, 2025 00:56
Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM. I'm curious how much future optimization passes are going to resemble this one - it seems like this touches quite a few parts of the existing query planner. Is that to be expected?

return
}

if dup, ok := node.(*commonsubexpressionelimination.Duplicate); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that future optimization passes will also need to be done after CSE and will need to be aware of previously applied optimizations?

Copy link
Contributor Author

@charleskorn charleskorn Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's going to depend on the optimisation. I'm really hoping that we can avoid every optimisation knowing about every other optimisation before it. If CSE becomes a common source of problems like here, then I'm thinking we could reverse the order, and move all the special logic into the CSE optimisation pass - this would increase the complexity of the CSE pass, but make all the others simpler, and keep everything CSE-related in one place.

}

func (o *EngineOpts) RegisterFlags(f *flag.FlagSet) {
f.BoolVar(&o.EnableCommonSubexpressionElimination, "querier.mimir-query-engine.enable-common-subexpression-elimination", true, "Enable common subexpression elimination when evaluating queries.")
f.BoolVar(&o.EnableSkippingHistogramDecoding, "querier.mimir-query-engine.enable-skipping-histogram-decoding", true, "Enable skipping histogram decoding when evaluating queries. Only applies if query planner is enabled.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this help text be expanded to qualify when and why skipping of decoding is done? Without looking at the code I wouldn't know that we skip decoding histograms as part of a query because it's not needed in some cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 9d3abf0

@charleskorn
Copy link
Contributor Author

Given the approval, I'm going to merge this, but if you have any further feedback, please let me know.

@charleskorn charleskorn merged commit 854a123 into main Jul 2, 2025
33 checks passed
@charleskorn charleskorn deleted the charleskorn/mqe-skip-histogram-decoding branch July 2, 2025 02:31
@charleskorn charleskorn mentioned this pull request Jul 2, 2025
1 task
charleskorn added a commit that referenced this pull request Jul 2, 2025
#### What this PR does

This PR fixes a broken test introduced by #11587 that did not take into
account the changes from #11900.

#### Which issue(s) this PR fixes or relates to

#11587

#### Checklist

- [x] Tests updated.
- [n/a] Documentation added.
- [n/a] `CHANGELOG.md` updated - the order of entries should be
`[CHANGE]`, `[FEATURE]`, `[ENHANCEMENT]`, `[BUGFIX]`. If changelog entry
is not needed, please add the `changelog-not-needed` label to the PR.
- [n/a]
[`about-versioning.md`](https://github.com/grafana/mimir/blob/main/docs/sources/mimir/configure/about-versioning.md)
updated with experimental features.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog-not-needed PRs that don't need a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants