Skip to content

Conversation

dashpole
Copy link
Contributor

@dashpole dashpole commented Jun 26, 2025

Part of #16610. See #16610 (comment) for alternatives.

The current OTLP endpoint translates OTLP to PRW 1.0, and then appends the metrics to the TSDB using the PRW 1.0 ingestion path. This PR migrates this translation logic to instead translate to PRW 2.0 and use the PRW 2.0 ingestion path.

Currently, the PRW 1.0 endpoint just ignores metadata, so we are dropping type, help, and unit information. Migrating to 2.0 will fix that. It will also allow the OTLP endpoint to use the type and unit feature implementation without re-implementing it in the translation layer.

This PR can be largely summarized as:

  • Rename prompb types to writev2 types. Those are split into their own commits.
  • Migrate from prompb.Label with labels.Label, and adding support for symbolization.
  • Metadata is attached to each TimeSeries, instead of being stored separately. Metadata is added to all unit tests.
  • Migrate the OTLP write handler to use writeV2 instead of write
  • Small fixes

The existing e2e tests in write_test.go didn't need any changes, which helps to show that this doesn't change any behavior.

dashpole added 23 commits June 26, 2025 14:16
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
dashpole added 2 commits June 26, 2025 17:39
Signed-off-by: David Ashpole <dashpole@google.com>
Signed-off-by: David Ashpole <dashpole@google.com>
@dashpole dashpole marked this pull request as ready for review June 26, 2025 17:54
@dashpole dashpole requested a review from aknuds1 as a code owner June 26, 2025 17:54
…allowed

Signed-off-by: David Ashpole <dashpole@google.com>
@krajorama krajorama self-requested a review July 2, 2025 08:32
Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the optimization @bwplotka suggested in getOrCreateTimeSeries, which improved it from +50% CPU usage to +35%.

Given that we don't need to save on throughput internally, I'd suggest to solve metadata differently.

I mean the other things that Remote-Write 2.0 brings are native histograms with custom buckets (NHCB) and created timestamp.

@carrieedwards already added support for NHCB over Remote-Write 1.0 internally: #15850.

I'm starting work on adding support for created timestamp and I plan to do the same, add the field internally to Remote-Write 1.0. Not done, yet, only in Mimir.

WDYT?

@dashpole
Copy link
Contributor Author

dashpole commented Jul 2, 2025

If we are open to adding "internal" fields to protocols, we could consider adding labels directly to the writev2.Timeseries... I prototyped that, and it reduces allocations by another 20%.

I'm more seriously considering not translating to PRW at all, and writing straight to the appender... Seems like it is causing way more problems than it solves, and I don't love adding fields to protocols that are meant to be used externally

@dashpole
Copy link
Contributor Author

dashpole commented Jul 2, 2025

I think I might have figured out why there is such a high performance cost. isSameMetric has a bug where it returns true as long as the labels are the same length.

for i, l := range ts.Labels {
if l.Name != ts.Labels[i].Name || l.Value != ts.Labels[i].Value {
return false
}
}

This is comparing elements in ts.Labels to themselves, rather than to the input labels.

bwplotka added a commit to prometheus/common that referenced this pull request Jul 3, 2025
Useful to use e.g. in prometheus/prometheus#16784

Signed-off-by: bwplotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member

bwplotka commented Jul 3, 2025

FYI added prometheus/common#801

@bwplotka
Copy link
Member

bwplotka commented Jul 3, 2025

@krajorama are we sure resurrecting dead PRW1 is a good idea? What's blocking on PRW2?

#16784 (review)

@bwplotka
Copy link
Member

bwplotka commented Jul 3, 2025

I would say this code has to go through PRW2 path or natively, there's little point in sticking to the old path forever

bwplotka added a commit to prometheus/common that referenced this pull request Jul 3, 2025
Useful to use e.g. in prometheus/prometheus#16784

Signed-off-by: bwplotka <bwplotka@gmail.com>
@krajorama
Copy link
Member

@krajorama are we sure resurrecting dead PRW1 is a good idea? What's blocking on PRW2?

#16784 (review)

I don't want to resurrect PRW1.

However I don't want to see +35% CPU in the OTLP endpoint.

I'm working on a prototype to not use either PRW1 nor PRW2 in the OTLP endpoint, rather call the storage.Appender interface directly instead.

@bwplotka
Copy link
Member

bwplotka commented Jul 3, 2025

Yes, the native way is the long term solution, make sense

@bboreham
Copy link
Member

bboreham commented Jul 3, 2025

isSameMetric has a bug

Bug was fixed in open-telemetry/opentelemetry-collector-contrib#35763 ?

Did we set up a mechanism to be informed of bugfixes in these two copies of the code?

@krajorama
Copy link
Member

Did a little thought experiment, to see what it would take to use the Appender interface: #16827

Doesn't look terrible at first glance. Certainly if the Appender is writing directly to TSDB, I can imagine it being pretty fast.

In Mimir we don't write to TSDB directly, we use PRW1.0 for further processing and sending over to storage. Which means we'd need to take the input to the Appender interface and convert to PRW 1.0 . But the Appender in Mimir takes stringlabels and PRW1 has list of labels. So we'd be converting labels to stringlabels for the Appender and then back to a list of pairs of strings for PRW1.0. That doesn't sound great.

Also the Appender interface allows for random access, that is I could append a float sample to series "A", write a 1000 different series then write the exemplar for series "A". In fact adding the exemplars for classic histograms does this in the code, writes the buckets first, then the exemplars for those buckets. This random access would be insane to implement in an Appender that just spits out PRW1.

So at second glance I'm not convinced this is the right direction long term.

@bboreham
Copy link
Member

bboreham commented Jul 3, 2025

the Appender in Mimir takes stringlabels

When I was trying to do something comparable, I extended Appender with a method taking a callback, so I could fetch the SeriesRef without having to construct a new labels.Labels except for brand-new series.

It was in this (unmerged) PR: grafana/mimir#6979

	GetRefFunc(hash uint64, cmp func(labels.Labels) bool) (SeriesRef, labels.Labels)

@krajorama
Copy link
Member

So at second glance I'm not convinced this is the right direction long term.

But looking at the two converters side by side, there is layer we could make into an interface. It would take map[string]string for labels and take sample+metadata+exemplar(s) in one go. That translates well for TSDB Appender or converting into some protocol I would think.

So maybe we could do that if PRW2.0 doesn't work out for this use case?

@bboreham
Copy link
Member

bboreham commented Jul 3, 2025

It would take map[string]string for labels

Prometheus stopped using maps for labels internally in 2016, as it is far cheaper to use a slice.

@krajorama
Copy link
Member

It would take map[string]string for labels

Prometheus stopped using maps for labels internally in 2016, as it is far cheaper to use a slice.

The reason to suggest it is because the conversion code uses it already :

l := make(map[string]string, maxLabelCount)
.

Could be something else as long as we don't need to unpack it for either branch: Appender nor Remote Write

@bboreham
Copy link
Member

bboreham commented Jul 4, 2025

the conversion code uses it already

That's pretty internal to that function, apparently aimed at detecting duplicates. Though this is questionable - labels need to be sorted anyway so deferring that detection to after you've finished and sorted them will be much cheaper.

Incidentally I couldn't immediately see how labels were sorted in that code; it happens as a side-effect of timeSeriesSignature(). Which should at least be commented.

@dashpole dashpole marked this pull request as draft July 7, 2025 19:13
@dashpole
Copy link
Contributor Author

dashpole commented Jul 7, 2025

I've also been working on a migration PR (instead of a re-write): main...dashpole:prometheus:otlp_to_labels. I haven't updated the unit tests in the translator package, but it passes the tests in write_test.go.

Also the Appender interface allows for random access, that is I could append a float sample to series "A", write a 1000 different series then write the exemplar for series "A". In fact adding the exemplars for classic histograms does this in the code, writes the buckets first, then the exemplars for those buckets. This random access would be insane to implement in an Appender that just spits out PRW1.

Yeah, that is pretty much what the OTel prometheusreceiver has to do, and it is pretty complex.

@bwplotka
Copy link
Member

bwplotka commented Jul 8, 2025

Thanks for discussions and experiments!

Just to summarize the current discussion (let me know if I'm wrong)

  1. Prometheus has:
  • PRW1 -> Appender
  • PRW2 -> Appender
  • Scrape -> Appender
  • OTLP -> PRW1 -> Appender # Causes complexities, ideally it's OTLP -> Appender.
  1. Moving to Appender looks like the best "native" option, however it's not ideal for Mimir, because Mimir does depend on OTLP to PRW1 mechanism because:
  • PRW1 -> Mimir
  • PRW2 -> PRW1 -> Mimir
  • OTLP -> PRW1 -> Mimir # Implementing OTLP -> Appender will not work with Mimir and OTLP -> Appender -> PRW is tricky to non sequential appends.
  1. Also Otel collector uses Prometheus appender interface in some paths and have same issue on appender:
  • (prometheusreceiver) Scrape -> Appender -> OTel data model # Also tricky due to non sequential appends.
  • (prometheusremotewritereceiver) PRW 1/2 -> Otel data model

Unblocking questions then...

A) It feels it's still nice to purse OTLP -> Appender long term, how we could unblock this? Would it be helpful for Mimir to vendor OTLP -> PRW1 path mid-term to unblock Prometheus upstream?
B) Should we create appender API (v2?) for "atomic" appends (series + float + ts + exemplar + type and unit + CT appends). Am I right that for all of ingestion formats (other than experimental metadata on PRW1 and metadata-wal-records feature) already would use this? -- scrape, PRW and OTLP.

Funny enough, related to (B) I already "had" to do something like this to support efficient CTs appends...

Also if we do (B) how breaking we want it to be... do we change in-place or we make Thanos, Cortex, Mimir life easier by maintaining both v1 and v2 paths for key flows...

@krajorama @bboreham @dashpole

@krajorama
Copy link
Member

krajorama commented Jul 9, 2025

Yes, there's always the possibility to deviate our Mimir fork more for some time.

I'd like to do a POC, where our Appender that converts to RW1 would return the index+1 of the time series in the slice of time series in the RW1 request as reference. Of course 0 means this is a new time series and we append.

I'm guessing that would solve the issue of non atomic Append calls.

I have to think about the stringlabels and probably discuss with @bboreham . I could imagine a trick where we fake things by putting the stringlabel data into a special label: __stringlabels__ and also set some header or something to let the other side know how to unmarshal it.

@dashpole
Copy link
Contributor Author

dashpole commented Jul 9, 2025

I think that makes sense. It actually cleans up the translator code to do that too. I prototyped it in 97ae3e1#diff-3d1ac744df30ff13a5bc3f2fe4dd9a735714cdda480b0f9520e05be97ec3bc8aR37

@dashpole
Copy link
Contributor Author

Closing in favor of #16951

@dashpole dashpole closed this Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants