-
Notifications
You must be signed in to change notification settings - Fork 598
enforce max series for metrics queries #4525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding docs.
604930d
to
2c986ed
Compare
4aa4ee3
to
a073454
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the metric query look now? Are the series evenly distributed? I'm asking because we have an issue with exemplars where we enforce a similar limit, and they appear to be skewed
ba3fb50
to
b0989bc
Compare
benchmark shows no noticeable difference with added call to Length()/length()
|
require.NoError(t, tempo.WaitSumMetricsWithOptions(e2e.GreaterOrEqual(1), []string{"tempo_metrics_generator_processor_local_blocks_spans_total"}, e2e.WaitMissingMetrics)) | ||
require.NoError(t, tempo.WaitSumMetricsWithOptions(e2e.GreaterOrEqual(1), []string{"tempo_metrics_generator_processor_local_blocks_cut_blocks"}, e2e.WaitMissingMetrics)) | ||
|
||
query := "{} | rate() by (span:id)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that this is our test case 💯
@@ -43,6 +43,11 @@ func newQueryInstantStreamingGRPCHandler(cfg Config, next pipeline.AsyncRoundTri | |||
End: req.End, | |||
Step: req.End - req.Start, | |||
} | |||
// if a limit is being enforced, honor the request if it is less than the limit | |||
// else set it to max limit | |||
if cfg.Metrics.Sharder.MaxResponseSeries > 0 && (qr.MaxSeries > uint32(cfg.Metrics.Sharder.MaxResponseSeries) || qr.MaxSeries == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the handlers need to do this too, when the sharder is already doing it? It looks like doing in the 1 place in the sharder should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the handler needs it to pass it to the combiner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. Let's still consolidate it to 1 function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it only needs to be done here? the sharder can assume that the value is set correctly?
although looking at search it looks like we repeat the logic 2x. once in the sharder and once in the handler:
https://github.com/grafana/tempo/blob/main/modules/frontend/search_sharder.go#L82
https://github.com/grafana/tempo/blob/main/modules/frontend/search_handlers.go#L128
it seems like we should establish a better pattern and only do this once, but agree with marty that consolidating the logic in one func is the right choice
} | ||
jobEval.ObserveSeries(resp) | ||
seriesCount := jobEval.Length() | ||
syncAtomic.AddInt32(&totalSeriesCount, int32(seriesCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this part a bit - The atomic is used to skip remaining blocks when we exceed the limit, but should it be a Set
instead of Add
? It looks like if we have 100 complete blocks all returning the same 10 series, it can still hit the (default) 1000 limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice catch. The atomic should just be used to keep track of the raw eval for wal and head blocks. The jobEval.length() is already tracking the series for the completed blocks. The check should now be
raw series count + jobEval.length() > max series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I think I am missing something or there is still a change needed. If there are 2 wal blocks and they each result in 501 series, it would add to the atomic twice and consider the limit exceeded, even if it's the same series. Likewise if wal rawEval and complete jobEval each have (the same) 501 series, totalRawResultsSeriesCount + jobEval.Length would be considered above the limit. I think what we can do is run and check them both independently, skipping blocks of each type if the matching eval is exceeded. Instance.go already checks again when do the final combine. Would it be helpful to add func (m *MetricsEvaluator) Length()
and we don't need the atomics at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use atomic bool for maxSeriesReached which will be set to true if either rawEval.length() > maxSeries or jobEval.length() > maxSeries
ff1490a
to
b3557a5
Compare
What this PR does: Add config to enforce max time series returned in a metrics query. This is enforced at 4 levels: front-end combiner, querier combiner, metrics-generator local blocks, and metrics evaluation. The configuration is set in the query-frontend config and is passed to all levels as
maxSeries
in theQueryRangeRequest
proto.new config: max_response_series <default 1000>
Setting the value to 0 will disable this feature.
approach : Keep track of number of series by calling Length() until limit is reached. Whatever series were generated up to this point will be truncated at the limit and returned as partial results. This may mean that the partial response will contain values that are inaccurate.
Which issue(s) this PR fixes:
Fixes #4219
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
benchmarks