Migrate OTLP endpoint to PRW 2.0 for ingestion #16784

dashpole · 2025-06-26T14:15:53Z

Part of #16610. See #16610 (comment) for alternatives.

The current OTLP endpoint translates OTLP to PRW 1.0, and then appends the metrics to the TSDB using the PRW 1.0 ingestion path. This PR migrates this translation logic to instead translate to PRW 2.0 and use the PRW 2.0 ingestion path.

Currently, the PRW 1.0 endpoint just ignores metadata, so we are dropping type, help, and unit information. Migrating to 2.0 will fix that. It will also allow the OTLP endpoint to use the type and unit feature implementation without re-implementing it in the translation layer.

This PR can be largely summarized as:

Rename prompb types to writev2 types. Those are split into their own commits.
Migrate from prompb.Label with labels.Label, and adding support for symbolization.
Metadata is attached to each TimeSeries, instead of being stored separately. Metadata is added to all unit tests.
Migrate the OTLP write handler to use writeV2 instead of write
Small fixes

The existing e2e tests in write_test.go didn't need any changes, which helps to show that this doesn't change any behavior.

Signed-off-by: David Ashpole <dashpole@google.com>

…allowed Signed-off-by: David Ashpole <dashpole@google.com>

krajorama

I implemented the optimization @bwplotka suggested in getOrCreateTimeSeries, which improved it from +50% CPU usage to +35%.

Given that we don't need to save on throughput internally, I'd suggest to solve metadata differently.

I mean the other things that Remote-Write 2.0 brings are native histograms with custom buckets (NHCB) and created timestamp.

@carrieedwards already added support for NHCB over Remote-Write 1.0 internally: #15850.

I'm starting work on adding support for created timestamp and I plan to do the same, add the field internally to Remote-Write 1.0. Not done, yet, only in Mimir.

WDYT?

dashpole · 2025-07-02T15:12:20Z

If we are open to adding "internal" fields to protocols, we could consider adding labels directly to the writev2.Timeseries... I prototyped that, and it reduces allocations by another 20%.

I'm more seriously considering not translating to PRW at all, and writing straight to the appender... Seems like it is causing way more problems than it solves, and I don't love adding fields to protocols that are meant to be used externally

dashpole · 2025-07-02T18:18:49Z

I think I might have figured out why there is such a high performance cost. isSameMetric has a bug where it returns true as long as the labels are the same length.

prometheus/storage/remote/otlptranslator/prometheusremotewrite/metrics_to_prw.go

Lines 277 to 281 in 74aca68

    
           for i, l := range ts.Labels { 
        
           	if l.Name != ts.Labels[i].Name || l.Value != ts.Labels[i].Value { 
        
           		return false 
        
           	} 
        
           }

This is comparing elements in ts.Labels to themselves, rather than to the input labels.

Useful to use e.g. in prometheus/prometheus#16784 Signed-off-by: bwplotka <bwplotka@gmail.com>

bwplotka · 2025-07-03T11:41:21Z

FYI added prometheus/common#801

bwplotka · 2025-07-03T11:42:38Z

@krajorama are we sure resurrecting dead PRW1 is a good idea? What's blocking on PRW2?

#16784 (review)

bwplotka · 2025-07-03T11:44:29Z

I would say this code has to go through PRW2 path or natively, there's little point in sticking to the old path forever

Useful to use e.g. in prometheus/prometheus#16784 Signed-off-by: bwplotka <bwplotka@gmail.com>

krajorama · 2025-07-03T11:46:41Z

@krajorama are we sure resurrecting dead PRW1 is a good idea? What's blocking on PRW2?

#16784 (review)

I don't want to resurrect PRW1.

However I don't want to see +35% CPU in the OTLP endpoint.

I'm working on a prototype to not use either PRW1 nor PRW2 in the OTLP endpoint, rather call the storage.Appender interface directly instead.

bwplotka · 2025-07-03T11:47:17Z

Yes, the native way is the long term solution, make sense

bboreham · 2025-07-03T12:27:55Z

isSameMetric has a bug

Bug was fixed in open-telemetry/opentelemetry-collector-contrib#35763 ?

Did we set up a mechanism to be informed of bugfixes in these two copies of the code?

krajorama · 2025-07-03T14:07:24Z

Did a little thought experiment, to see what it would take to use the Appender interface: #16827

Doesn't look terrible at first glance. Certainly if the Appender is writing directly to TSDB, I can imagine it being pretty fast.

In Mimir we don't write to TSDB directly, we use PRW1.0 for further processing and sending over to storage. Which means we'd need to take the input to the Appender interface and convert to PRW 1.0 . But the Appender in Mimir takes stringlabels and PRW1 has list of labels. So we'd be converting labels to stringlabels for the Appender and then back to a list of pairs of strings for PRW1.0. That doesn't sound great.

Also the Appender interface allows for random access, that is I could append a float sample to series "A", write a 1000 different series then write the exemplar for series "A". In fact adding the exemplars for classic histograms does this in the code, writes the buckets first, then the exemplars for those buckets. This random access would be insane to implement in an Appender that just spits out PRW1.

So at second glance I'm not convinced this is the right direction long term.

bboreham · 2025-07-03T14:34:30Z

the Appender in Mimir takes stringlabels

When I was trying to do something comparable, I extended Appender with a method taking a callback, so I could fetch the SeriesRef without having to construct a new labels.Labels except for brand-new series.

It was in this (unmerged) PR: grafana/mimir#6979

	GetRefFunc(hash uint64, cmp func(labels.Labels) bool) (SeriesRef, labels.Labels)

krajorama · 2025-07-03T15:38:00Z

So at second glance I'm not convinced this is the right direction long term.

But looking at the two converters side by side, there is layer we could make into an interface. It would take map[string]string for labels and take sample+metadata+exemplar(s) in one go. That translates well for TSDB Appender or converting into some protocol I would think.

So maybe we could do that if PRW2.0 doesn't work out for this use case?

bboreham · 2025-07-03T15:48:09Z

It would take map[string]string for labels

Prometheus stopped using maps for labels internally in 2016, as it is far cheaper to use a slice.

krajorama · 2025-07-03T18:23:23Z

It would take map[string]string for labels

Prometheus stopped using maps for labels internally in 2016, as it is far cheaper to use a slice.

The reason to suggest it is because the conversion code uses it already :

prometheus/storage/remote/otlptranslator/prometheusremotewrite/helper.go

Line 157 in 5a5424c

l := make(map[string]string, maxLabelCount)

.

Could be something else as long as we don't need to unpack it for either branch: Appender nor Remote Write

bboreham · 2025-07-04T11:30:40Z

the conversion code uses it already

That's pretty internal to that function, apparently aimed at detecting duplicates. Though this is questionable - labels need to be sorted anyway so deferring that detection to after you've finished and sorted them will be much cheaper.

Incidentally I couldn't immediately see how labels were sorted in that code; it happens as a side-effect of timeSeriesSignature(). Which should at least be commented.

dashpole · 2025-07-07T19:18:21Z

I've also been working on a migration PR (instead of a re-write): main...dashpole:prometheus:otlp_to_labels. I haven't updated the unit tests in the translator package, but it passes the tests in write_test.go.

Also the Appender interface allows for random access, that is I could append a float sample to series "A", write a 1000 different series then write the exemplar for series "A". In fact adding the exemplars for classic histograms does this in the code, writes the buckets first, then the exemplars for those buckets. This random access would be insane to implement in an Appender that just spits out PRW1.

Yeah, that is pretty much what the OTel prometheusreceiver has to do, and it is pretty complex.

bwplotka · 2025-07-08T15:12:01Z

Thanks for discussions and experiments!

Just to summarize the current discussion (let me know if I'm wrong)

Prometheus has:

PRW1 -> Appender
PRW2 -> Appender
Scrape -> Appender
OTLP -> PRW1 -> Appender # Causes complexities, ideally it's OTLP -> Appender.

Moving to Appender looks like the best "native" option, however it's not ideal for Mimir, because Mimir does depend on OTLP to PRW1 mechanism because:

PRW1 -> Mimir
PRW2 -> PRW1 -> Mimir
OTLP -> PRW1 -> Mimir # Implementing OTLP -> Appender will not work with Mimir and OTLP -> Appender -> PRW is tricky to non sequential appends.

Also Otel collector uses Prometheus appender interface in some paths and have same issue on appender:

(prometheusreceiver) Scrape -> Appender -> OTel data model # Also tricky due to non sequential appends.
(prometheusremotewritereceiver) PRW 1/2 -> Otel data model

Unblocking questions then...

A) It feels it's still nice to purse OTLP -> Appender long term, how we could unblock this? Would it be helpful for Mimir to vendor OTLP -> PRW1 path mid-term to unblock Prometheus upstream?
B) Should we create appender API (v2?) for "atomic" appends (series + float + ts + exemplar + type and unit + CT appends). Am I right that for all of ingestion formats (other than experimental metadata on PRW1 and metadata-wal-records feature) already would use this? -- scrape, PRW and OTLP.

Funny enough, related to (B) I already "had" to do something like this to support efficient CTs appends...

Also if we do (B) how breaking we want it to be... do we change in-place or we make Thanos, Cortex, Mimir life easier by maintaining both v1 and v2 paths for key flows...

@krajorama @bboreham @dashpole

krajorama · 2025-07-09T10:15:10Z

Yes, there's always the possibility to deviate our Mimir fork more for some time.

I'd like to do a POC, where our Appender that converts to RW1 would return the index+1 of the time series in the slice of time series in the RW1 request as reference. Of course 0 means this is a new time series and we append.

I'm guessing that would solve the issue of non atomic Append calls.

I have to think about the stringlabels and probably discuss with @bboreham . I could imagine a trick where we fake things by putting the stringlabel data into a special label: __stringlabels__ and also set some header or something to let the other side know how to unmarshal it.

dashpole · 2025-07-09T14:05:45Z

I think that makes sense. It actually cleans up the translator code to do that too. I prototyped it in 97ae3e1#diff-3d1ac744df30ff13a5bc3f2fe4dd9a735714cdda480b0f9520e05be97ec3bc8aR37

dashpole · 2025-07-30T17:20:45Z

Closing in favor of #16951

dashpole added 23 commits June 26, 2025 14:16

prompb.TimeSeries -> writev2.TimeSeries

8b0b616

Signed-off-by: David Ashpole <dashpole@google.com>

prompb.MetricMetadata -> writev2.Metadata

a46df75

Signed-off-by: David Ashpole <dashpole@google.com>

prompb.Sample -> writev2.Sample

3763bd8

Signed-off-by: David Ashpole <dashpole@google.com>

prompb.Exemplar -> writev2.Exemplar

2423003

Signed-off-by: David Ashpole <dashpole@google.com>

prompb.Histogram -> writev2.Histogram

6935d07

Signed-off-by: David Ashpole <dashpole@google.com>

prompb.BucketSpan -> writev2.BucketSpan

52d3bf3

Signed-off-by: David Ashpole <dashpole@google.com>

add symbol table

cc6532e

Signed-off-by: David Ashpole <dashpole@google.com>

prompb.Label -> labels.Label

e13df88

Signed-off-by: David Ashpole <dashpole@google.com>

replace timeSeriesSignature with labels.Labels.Hash()

438bcfe

Signed-off-by: David Ashpole <dashpole@google.com>

replace labels.Labels{ with labels.New

008901a

Signed-off-by: David Ashpole <dashpole@google.com>

use labels.Builder to build labels

4d00b6d

Signed-off-by: David Ashpole <dashpole@google.com>

use labels builder for exemplar and metric labels

90fa6cb

Signed-off-by: David Ashpole <dashpole@google.com>

iterate over labels.Labels using Range

7ac34cf

Signed-off-by: David Ashpole <dashpole@google.com>

use labels.ScratchBuilder to build promoted resource labels

b4c3bb4

Signed-off-by: David Ashpole <dashpole@google.com>

use labels.Equal in isSameMetric

cdbd50b

Signed-off-by: David Ashpole <dashpole@google.com>

use labels.Len() instead of len

e099cce

Signed-off-by: David Ashpole <dashpole@google.com>

use labels.Builder to implement created timeseries

f77e277

Signed-off-by: David Ashpole <dashpole@google.com>

replace isSameMetric with labels.Equal

79a85ff

Signed-off-by: David Ashpole <dashpole@google.com>

fix unit tests

5eccb76

Signed-off-by: David Ashpole <dashpole@google.com>

migrate OTLP write handler to use PRW 2.0

e67e9a3

Signed-off-by: David Ashpole <dashpole@google.com>

add Metadata directly to TimeSeries

abaf3e3

Signed-off-by: David Ashpole <dashpole@google.com>

re-use labels.ScratchBuilder and labels.Builder

62e8edd

Signed-off-by: David Ashpole <dashpole@google.com>

fixes after rebase

dc834df

Signed-off-by: David Ashpole <dashpole@google.com>

dashpole force-pushed the otlp_prw_2 branch from 926e956 to dc834df Compare June 26, 2025 15:24

fix lint

f79b1b4

Signed-off-by: David Ashpole <dashpole@google.com>

dashpole force-pushed the otlp_prw_2 branch from 1f35f72 to f79b1b4 Compare June 26, 2025 16:29

dashpole added 2 commits June 26, 2025 17:39

ensure labels are sorted before merging values

9ab8496

Signed-off-by: David Ashpole <dashpole@google.com>

use different metric name for OTLP endpoint metrics

9d1ba67

Signed-off-by: David Ashpole <dashpole@google.com>

dashpole marked this pull request as ready for review June 26, 2025 17:54

dashpole requested a review from aknuds1 as a code owner June 26, 2025 17:54

optimize for the case where UTF8 is allowed, and benchmark with UTF8 …

3ae01b7

…allowed Signed-off-by: David Ashpole <dashpole@google.com>

krajorama self-requested a review July 2, 2025 08:32

krajorama requested changes Jul 2, 2025

View reviewed changes

bwplotka added a commit to prometheus/common that referenced this pull request Jul 3, 2025

model: add constants for type and unit labels.

b8808d5

Useful to use e.g. in prometheus/prometheus#16784 Signed-off-by: bwplotka <bwplotka@gmail.com>

bwplotka mentioned this pull request Jul 3, 2025

model: add constants for type and unit labels. prometheus/common#801

Merged

bwplotka added a commit to prometheus/common that referenced this pull request Jul 3, 2025

model: add constants for type and unit labels. (#801)

0a409d6

Useful to use e.g. in prometheus/prometheus#16784 Signed-off-by: bwplotka <bwplotka@gmail.com>

krajorama mentioned this pull request Jul 3, 2025

POC: convert OTLP to storage.Appender instead of remote write #16827

Closed

dashpole marked this pull request as draft July 7, 2025 19:13

dashpole mentioned this pull request Jul 10, 2025

OTLP to directly writes to storage.Appender #16855

Closed

krajorama mentioned this pull request Jul 30, 2025

OTLP to directly write to an interface which can hide storage details #16951

Open

dashpole closed this Jul 30, 2025

Migrate OTLP endpoint to PRW 2.0 for ingestion #16784

Migrate OTLP endpoint to PRW 2.0 for ingestion #16784

Conversation

dashpole commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krajorama left a comment

Choose a reason for hiding this comment

Uh oh!

dashpole commented Jul 2, 2025

Uh oh!

dashpole commented Jul 2, 2025

Uh oh!

bwplotka commented Jul 3, 2025

Uh oh!

bwplotka commented Jul 3, 2025

Uh oh!

bwplotka commented Jul 3, 2025

Uh oh!

krajorama commented Jul 3, 2025

Uh oh!

bwplotka commented Jul 3, 2025

Uh oh!

bboreham commented Jul 3, 2025

Uh oh!

krajorama commented Jul 3, 2025

Uh oh!

bboreham commented Jul 3, 2025

Uh oh!

krajorama commented Jul 3, 2025

Uh oh!

bboreham commented Jul 3, 2025

Uh oh!

krajorama commented Jul 3, 2025

Uh oh!

bboreham commented Jul 4, 2025

Uh oh!

dashpole commented Jul 7, 2025

Uh oh!

bwplotka commented Jul 8, 2025

Uh oh!

krajorama commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dashpole commented Jul 9, 2025

Uh oh!

dashpole commented Jul 30, 2025

Uh oh!

Uh oh!

dashpole commented Jun 26, 2025 •

edited

Loading

krajorama commented Jul 9, 2025 •

edited

Loading