-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Migrate OTLP endpoint to PRW 2.0 for ingestion #16784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
…allowed Signed-off-by: David Ashpole <dashpole@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the optimization @bwplotka suggested in getOrCreateTimeSeries, which improved it from +50% CPU usage to +35%.
Given that we don't need to save on throughput internally, I'd suggest to solve metadata differently.
I mean the other things that Remote-Write 2.0 brings are native histograms with custom buckets (NHCB) and created timestamp.
@carrieedwards already added support for NHCB over Remote-Write 1.0 internally: #15850.
I'm starting work on adding support for created timestamp and I plan to do the same, add the field internally to Remote-Write 1.0. Not done, yet, only in Mimir.
WDYT?
If we are open to adding "internal" fields to protocols, we could consider adding labels directly to the writev2.Timeseries... I prototyped that, and it reduces allocations by another 20%. I'm more seriously considering not translating to PRW at all, and writing straight to the appender... Seems like it is causing way more problems than it solves, and I don't love adding fields to protocols that are meant to be used externally |
I think I might have figured out why there is such a high performance cost. isSameMetric has a bug where it returns true as long as the labels are the same length. prometheus/storage/remote/otlptranslator/prometheusremotewrite/metrics_to_prw.go Lines 277 to 281 in 74aca68
This is comparing elements in ts.Labels to themselves, rather than to the input labels. |
Useful to use e.g. in prometheus/prometheus#16784 Signed-off-by: bwplotka <bwplotka@gmail.com>
FYI added prometheus/common#801 |
@krajorama are we sure resurrecting dead PRW1 is a good idea? What's blocking on PRW2? |
I would say this code has to go through PRW2 path or natively, there's little point in sticking to the old path forever |
Useful to use e.g. in prometheus/prometheus#16784 Signed-off-by: bwplotka <bwplotka@gmail.com>
I don't want to resurrect PRW1. However I don't want to see +35% CPU in the OTLP endpoint. I'm working on a prototype to not use either PRW1 nor PRW2 in the OTLP endpoint, rather call the storage.Appender interface directly instead. |
Yes, the native way is the long term solution, make sense |
Bug was fixed in open-telemetry/opentelemetry-collector-contrib#35763 ? Did we set up a mechanism to be informed of bugfixes in these two copies of the code? |
Did a little thought experiment, to see what it would take to use the Appender interface: #16827 Doesn't look terrible at first glance. Certainly if the Appender is writing directly to TSDB, I can imagine it being pretty fast. In Mimir we don't write to TSDB directly, we use PRW1.0 for further processing and sending over to storage. Which means we'd need to take the input to the Appender interface and convert to PRW 1.0 . But the Appender in Mimir takes Also the Appender interface allows for random access, that is I could append a float sample to series "A", write a 1000 different series then write the exemplar for series "A". In fact adding the exemplars for classic histograms does this in the code, writes the buckets first, then the exemplars for those buckets. This random access would be insane to implement in an Appender that just spits out PRW1. So at second glance I'm not convinced this is the right direction long term. |
When I was trying to do something comparable, I extended It was in this (unmerged) PR: grafana/mimir#6979
|
But looking at the two converters side by side, there is layer we could make into an interface. It would take map[string]string for labels and take sample+metadata+exemplar(s) in one go. That translates well for TSDB Appender or converting into some protocol I would think. So maybe we could do that if PRW2.0 doesn't work out for this use case? |
Prometheus stopped using maps for labels internally in 2016, as it is far cheaper to use a slice. |
The reason to suggest it is because the conversion code uses it already :
Could be something else as long as we don't need to unpack it for either branch: Appender nor Remote Write |
That's pretty internal to that function, apparently aimed at detecting duplicates. Though this is questionable - labels need to be sorted anyway so deferring that detection to after you've finished and sorted them will be much cheaper. Incidentally I couldn't immediately see how labels were sorted in that code; it happens as a side-effect of |
I've also been working on a migration PR (instead of a re-write): main...dashpole:prometheus:otlp_to_labels. I haven't updated the unit tests in the translator package, but it passes the tests in write_test.go.
Yeah, that is pretty much what the OTel prometheusreceiver has to do, and it is pretty complex. |
Thanks for discussions and experiments! Just to summarize the current discussion (let me know if I'm wrong)
Unblocking questions then... A) It feels it's still nice to purse OTLP -> Appender long term, how we could unblock this? Would it be helpful for Mimir to vendor OTLP -> PRW1 path mid-term to unblock Prometheus upstream? Funny enough, related to (B) I already "had" to do something like this to support efficient CTs appends... Also if we do (B) how breaking we want it to be... do we change in-place or we make Thanos, Cortex, Mimir life easier by maintaining both v1 and v2 paths for key flows... |
Yes, there's always the possibility to deviate our Mimir fork more for some time. I'd like to do a POC, where our Appender that converts to RW1 would return the index+1 of the time series in the slice of time series in the RW1 request as reference. Of course 0 means this is a new time series and we append. I'm guessing that would solve the issue of non atomic Append calls. I have to think about the stringlabels and probably discuss with @bboreham . I could imagine a trick where we fake things by putting the stringlabel data into a special label: |
I think that makes sense. It actually cleans up the translator code to do that too. I prototyped it in 97ae3e1#diff-3d1ac744df30ff13a5bc3f2fe4dd9a735714cdda480b0f9520e05be97ec3bc8aR37 |
Closing in favor of #16951 |
Part of #16610. See #16610 (comment) for alternatives.
The current OTLP endpoint translates OTLP to PRW 1.0, and then appends the metrics to the TSDB using the PRW 1.0 ingestion path. This PR migrates this translation logic to instead translate to PRW 2.0 and use the PRW 2.0 ingestion path.
Currently, the PRW 1.0 endpoint just ignores metadata, so we are dropping type, help, and unit information. Migrating to 2.0 will fix that. It will also allow the OTLP endpoint to use the type and unit feature implementation without re-implementing it in the translation layer.
This PR can be largely summarized as:
writeV2
instead ofwrite
The existing e2e tests in write_test.go didn't need any changes, which helps to show that this doesn't change any behavior.