Skip to content

enforce max series for metrics queries #4525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Apr 11, 2025
Merged

Conversation

ie-pham
Copy link
Contributor

@ie-pham ie-pham commented Jan 7, 2025

What this PR does: Add config to enforce max time series returned in a metrics query. This is enforced at 4 levels: front-end combiner, querier combiner, metrics-generator local blocks, and metrics evaluation. The configuration is set in the query-frontend config and is passed to all levels as maxSeries in the QueryRangeRequest proto.

new config: max_response_series <default 1000>
Setting the value to 0 will disable this feature.

approach : Keep track of number of series by calling Length() until limit is reached. Whatever series were generated up to this point will be truncated at the limit and returned as partial results. This may mean that the partial response will contain values that are inaccurate.

Which issue(s) this PR fixes:
Fixes #4219

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • Once merged please let know @grafana/observability-traces-and-profiling that Tempo: Support partial metric results grafana#103595 can be worked on
benchmarks
goos: darwin
goarch: amd64
pkg: github.com/grafana/tempo/tempodb/encoding/vparquet4
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
                                                                                                       │  before.txt  │            after1000.txt            │
                                                                                                       │    sec/op    │    sec/op     vs base               │
BackendBlockQueryRange/{}_|_rate()/5                                                                     201.5m ± 11%   205.9m ± 16%       ~ (p=0.971 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          443.8m ±  2%   446.0m ±  3%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          251.8m ±  1%   246.8m ±  6%       ~ (p=0.123 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  433.3m ±  2%   440.9m ±  8%       ~ (p=0.529 n=10)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                3.721m ±  3%   3.710m ±  4%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5   136.1m ±  6%   146.8m ±  9%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         35.54m ±  9%   33.54m ±  3%       ~ (p=0.063 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  357.0m ±  4%   364.6m ±  3%       ~ (p=0.190 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               596.0m ±  4%   600.7m ±  8%       ~ (p=0.631 n=10)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              361.7m ± 13%   365.7m ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         528.3m ±  2%   519.9m ±  1%       ~ (p=0.105 n=10)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         524.4m ±  5%   519.4m ±  3%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         528.1m ±  3%   529.5m ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                        9.137 ±  2%    8.954 ±  3%       ~ (p=0.075 n=10)
geomean                                                                                                  279.8m         280.5m        +0.25%

                                                                                                       │ before.txt  │            after1000.txt             │
                                                                                                       │  MB_IO/op   │  MB_IO/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                      3.733 ± 0%    3.733 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                           3.984 ± 0%    3.984 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                           3.763 ± 0%    3.763 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                   3.904 ± 0%    3.904 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                289.5m ± 0%   289.5m ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    15.71 ± 0%    15.71 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                          3.905 ± 0%    3.904 ± 0%       ~ (p=0.070 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                   6.663 ± 0%    6.663 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5                6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                               6.663 ± 0%    6.663 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                        43.85 ± 0%    43.85 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                                                   5.385         5.385       -0.00%
¹ all samples are equal

                                                                                                       │  before.txt   │            after1000.txt             │
                                                                                                       │   spans/op    │  spans/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                     623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                 0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         1.754k ± 0%     1.754k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                       623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                                                              ²                +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                                                                                       │   before.txt   │             after1000.txt             │
                                                                                                       │    spans/s     │   spans/s     vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                     3.093M ± 10%     3.027M ± 14%       ~ (p=0.971 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          1.404M ±  2%     1.397M ±  3%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          2.475M ±  1%     2.526M ±  6%       ~ (p=0.123 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  1.438M ±  2%     1.414M ±  8%       ~ (p=0.529 n=10)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                 0.000 ±  0%      0.000 ±  0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    0.000 ±  0%      0.000 ±  0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         49.38k ±  8%     52.30k ±  3%       ~ (p=0.063 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  1.746M ±  4%     1.710M ±  3%       ~ (p=0.190 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               1.046M ±  4%     1.037M ±  7%       ~ (p=0.631 n=10)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              1.723M ± 12%     1.704M ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         1.180M ±  2%     1.199M ±  1%       ~ (p=0.105 n=10)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         1.189M ±  4%     1.200M ±  3%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         1.180M ±  3%     1.177M ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                       68.21k ±  2%     69.61k ±  3%       ~ (p=0.075 n=10)
geomean                                                                                                               ²                 +0.26%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding docs.

Copy link
Contributor

@javiermolinar javiermolinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the metric query look now? Are the series evenly distributed? I'm asking because we have an issue with exemplars where we enforce a similar limit, and they appear to be skewed

@ie-pham
Copy link
Contributor Author

ie-pham commented Mar 27, 2025

benchmark shows no noticeable difference with added call to Length()/length()

                                                                                                       │  before.txt  │            after1000.txt            │
                                                                                                       │    sec/op    │    sec/op     vs base               │
BackendBlockQueryRange/{}_|_rate()/5                                                                     201.5m ± 11%   205.9m ± 16%       ~ (p=0.971 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          443.8m ±  2%   446.0m ±  3%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          251.8m ±  1%   246.8m ±  6%       ~ (p=0.123 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  433.3m ±  2%   440.9m ±  8%       ~ (p=0.529 n=10)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                3.721m ±  3%   3.710m ±  4%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5   136.1m ±  6%   146.8m ±  9%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         35.54m ±  9%   33.54m ±  3%       ~ (p=0.063 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  357.0m ±  4%   364.6m ±  3%       ~ (p=0.190 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               596.0m ±  4%   600.7m ±  8%       ~ (p=0.631 n=10)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              361.7m ± 13%   365.7m ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         528.3m ±  2%   519.9m ±  1%       ~ (p=0.105 n=10)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         524.4m ±  5%   519.4m ±  3%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         528.1m ±  3%   529.5m ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                        9.137 ±  2%    8.954 ±  3%       ~ (p=0.075 n=10)
geomean                                                                                                  279.8m         280.5m        +0.25%

                                                                                                       │ before.txt  │            after1000.txt             │
                                                                                                       │  MB_IO/op   │  MB_IO/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                      3.733 ± 0%    3.733 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                           3.984 ± 0%    3.984 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                           3.763 ± 0%    3.763 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                   3.904 ± 0%    3.904 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                289.5m ± 0%   289.5m ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    15.71 ± 0%    15.71 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                          3.905 ± 0%    3.904 ± 0%       ~ (p=0.070 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                   6.663 ± 0%    6.663 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5                6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                               6.663 ± 0%    6.663 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                        43.85 ± 0%    43.85 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                                                   5.385         5.385       -0.00%
¹ all samples are equal

                                                                                                       │  before.txt   │            after1000.txt             │
                                                                                                       │   spans/op    │  spans/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                     623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                 0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         1.754k ± 0%     1.754k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                       623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                                                              ²                +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                                                                                       │   before.txt   │             after1000.txt             │
                                                                                                       │    spans/s     │   spans/s     vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                     3.093M ± 10%     3.027M ± 14%       ~ (p=0.971 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          1.404M ±  2%     1.397M ±  3%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          2.475M ±  1%     2.526M ±  6%       ~ (p=0.123 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  1.438M ±  2%     1.414M ±  8%       ~ (p=0.529 n=10)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                 0.000 ±  0%      0.000 ±  0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    0.000 ±  0%      0.000 ±  0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         49.38k ±  8%     52.30k ±  3%       ~ (p=0.063 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  1.746M ±  4%     1.710M ±  3%       ~ (p=0.190 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               1.046M ±  4%     1.037M ±  7%       ~ (p=0.631 n=10)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              1.723M ± 12%     1.704M ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         1.180M ±  2%     1.199M ±  1%       ~ (p=0.105 n=10)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         1.189M ±  4%     1.200M ±  3%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         1.180M ±  3%     1.177M ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                       68.21k ±  2%     69.61k ±  3%       ~ (p=0.075 n=10)
geomean                                                                                                               ²                 +0.26%                ²

require.NoError(t, tempo.WaitSumMetricsWithOptions(e2e.GreaterOrEqual(1), []string{"tempo_metrics_generator_processor_local_blocks_spans_total"}, e2e.WaitMissingMetrics))
require.NoError(t, tempo.WaitSumMetricsWithOptions(e2e.GreaterOrEqual(1), []string{"tempo_metrics_generator_processor_local_blocks_cut_blocks"}, e2e.WaitMissingMetrics))

query := "{} | rate() by (span:id)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that this is our test case 💯

@@ -43,6 +43,11 @@ func newQueryInstantStreamingGRPCHandler(cfg Config, next pipeline.AsyncRoundTri
End: req.End,
Step: req.End - req.Start,
}
// if a limit is being enforced, honor the request if it is less than the limit
// else set it to max limit
if cfg.Metrics.Sharder.MaxResponseSeries > 0 && (qr.MaxSeries > uint32(cfg.Metrics.Sharder.MaxResponseSeries) || qr.MaxSeries == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the handlers need to do this too, when the sharder is already doing it? It looks like doing in the 1 place in the sharder should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the handler needs it to pass it to the combiner

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Let's still consolidate it to 1 function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it only needs to be done here? the sharder can assume that the value is set correctly?

although looking at search it looks like we repeat the logic 2x. once in the sharder and once in the handler:

https://github.com/grafana/tempo/blob/main/modules/frontend/search_sharder.go#L82
https://github.com/grafana/tempo/blob/main/modules/frontend/search_handlers.go#L128

it seems like we should establish a better pattern and only do this once, but agree with marty that consolidating the logic in one func is the right choice

}
jobEval.ObserveSeries(resp)
seriesCount := jobEval.Length()
syncAtomic.AddInt32(&totalSeriesCount, int32(seriesCount))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this part a bit - The atomic is used to skip remaining blocks when we exceed the limit, but should it be a Set instead of Add? It looks like if we have 100 complete blocks all returning the same 10 series, it can still hit the (default) 1000 limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice catch. The atomic should just be used to keep track of the raw eval for wal and head blocks. The jobEval.length() is already tracking the series for the completed blocks. The check should now be

raw series count + jobEval.length() > max series

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think I am missing something or there is still a change needed. If there are 2 wal blocks and they each result in 501 series, it would add to the atomic twice and consider the limit exceeded, even if it's the same series. Likewise if wal rawEval and complete jobEval each have (the same) 501 series, totalRawResultsSeriesCount + jobEval.Length would be considered above the limit. I think what we can do is run and check them both independently, skipping blocks of each type if the matching eval is exceeded. Instance.go already checks again when do the final combine. Would it be helpful to add func (m *MetricsEvaluator) Length() and we don't need the atomics at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use atomic bool for maxSeriesReached which will be set to true if either rawEval.length() > maxSeries or jobEval.length() > maxSeries

@ie-pham ie-pham merged commit 895f187 into grafana:main Apr 11, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Limit series produced by TraceQL Metrics
6 participants