Skip to content

Conversation

ywwg
Copy link
Member

@ywwg ywwg commented Nov 14, 2024

Follow-on from #14943

Previously, the api was evaluating this regex to determine if the label
name was valid or not:

https://github.com/prometheus/common/blob/14bac55a992f7b83ab9d147a041e274606bdb607/model/labels.go#L94

However, I believe that the IsValid() function is what ought to be
used in the brave new utf8 era.

Before

$ curl localhost:9090/api/v1/label/host.name/values
{"status":"error","errorType":"bad_data","error":"invalid label name: \"host.name\""}

After

$ curl localhost:9090/api/v1/label/host.name/values
{"status":"success","data":["localhost"]}

It's very likely that I'm missing something here or you were already
planning to do this at some point but I just encountered this issue and
figured I'd give it a go.

@ywwg
Copy link
Member Author

ywwg commented Nov 14, 2024

windows error seems to be spurious

bwplotka
bwplotka previously approved these changes Nov 18, 2024
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM!

Should we do it in release-3.0 branch first (if yes, rebase on that branch), so we do 3.0.1? cc @jan--f

We need to put cap on how much we put into 3.0.x here. We will likely have multiple UTF-8 gaps.. I don't think it must go to 3.0.1, but I think it could (:

@ywwg
Copy link
Member Author

ywwg commented Nov 18, 2024

@jan--f said to commit to main and then backport as necessary. I can set up that PR

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good code-wise, but shouldn't there be an update to the documentation in https://github.com/prometheus/prometheus/blob/main/docs/querying/api.md ?

@jan--f jan--f added the backport-3.0 Fixes that need to be backported to release 3.0 label Nov 21, 2024
@beorn7
Copy link
Member

beorn7 commented Nov 21, 2024

FTR: I'm not sure if this should really be in v3.0.1. I would see it more as a missing feature than a real bug. (And I would like to get back to a regular release cadence ASAP, so that v3.1.0 is not far down the road.)
But if we want to release this in v3.0.1, then let's first rebase this PR on top of the v3.0 release branch.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for more comments. Documenting is hard…

In any case, before merging this, let's decide if this should go into v3.0.1 or v3.1.0, see other comment. (If need be, we need to rebase this on top of the release branch.)

The `data` section of the JSON response is a list of string label values.

This example queries for all label values for the `job` label:
This example queries for all label values for the `job_name` label:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nitpicking again, but let's not use a "weird" label name in the first example. job_name is a weird because it's so common to have a job label. Why would somebody have job_name? If your intention is to have a label name with an underscore here, let' use one that is actually common like http_status_code (which could then be http.status_code in the example below to even allude to the different meaning of . and _ in OTel).

Comment on lines 451 to 452
For label names that contain characters that are not valid in the legacy
Prometheus character set (letters, numbers, underscores, and colons), they
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, you can use non-legacy characters here as long as they are valid as URL components, can't we?

So I guess http://localhost:9090/api/v1/label/job.name/values would actually work?

If that's true, we only need the U__ encoding for characters that don't work as URL components (in particular /).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that, but that might lead people to use HTML encoding: "job%20name" which is not supported. So while job.name would work, I'd prefer to make the reasoning simpler and just ask that people use encoding when the name is non-legacy-compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By HTML encoding, you mean percent encoding, also known as URL encoding, don't you? I.e. https://en.wikipedia.org/wiki/Percent-encoding

I would expect that URL encoding is in fact supported, because that should just be dealt with by the Go HTTP libraries and the Prometheus code wouldn't even notice.

It might very well possible that we only need the U__ encoding for a / in the label name. (At least that was the case long ago for label values as they show up in URLs for pushing to the PGW. Of course, back then there was no U__ encoding, so we elected for base64.)

Sorry for beating this point for so long, but it feels confusing to refer to the legacy Prometheus character set here when there is no technical reason for it.

Copy link
Member

@beorn7 beorn7 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that my assumptions above are all correct, I would use the following flow of explanation:

  • Don't mention URL encoding at all, because it is (IMHO) implied that that always works (it's part of the HTTP standard, and tools like curl and browsers will use it implicitly anyway).
  • Just state that label names starting with U__ are interpreted as encoded according to the Values Escaping method, which can be used at will, but also is the only way to specify label names that contain a /.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have rewritten it this way.


For label names that contain characters that are not valid in the legacy
Prometheus character set (letters, numbers, underscores, and colons), they
should be escaped using the Values Escaping method:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's handy to have a short description here, do we have a comprehensive explanation of this encoding method that we can link to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the prometheus proposal is the closest thing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beorn7 I can file an issue for the doc issue, is that enough to unblock this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to the Prometheus proposal is fine. I.e. link to https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md#text-escaping (as we did for PGW, just checked that out). No issue required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jan--f
Copy link
Contributor

jan--f commented Nov 25, 2024

Like beorn7 mentioned let's target release-3.0 for a bug fix. Apologies if I added confusion and extra work with my previous statement to target main.

Previously, the api was evaluating this regex to determine if the label
name was valid or not:

https://github.com/prometheus/common/blob/14bac55a992f7b83ab9d147a041e274606bdb607/model/labels.go#L94

However, I believe that the `IsValid()` function is what ought to be
used in the brave new utf8 era.

**Before**

```
$ curl localhost:9090/api/v1/label/host.name/values
{"status":"error","errorType":"bad_data","error":"invalid label name: \"host.name\""}
```

**After**

```
$ curl localhost:9090/api/v1/label/host.name/values
{"status":"success","data":["localhost"]}
```

It's very likely that I'm missing something here or you were already
planning to do this at some point but I just encountered this issue and
figured I'd give it a go.

Signed-off-by: Owen Williams <owen.williams@grafana.com>
@ywwg ywwg changed the base branch from main to release-3.0 November 25, 2024 16:49
@ywwg
Copy link
Member Author

ywwg commented Nov 25, 2024

rebased onto release-3.0

@ywwg ywwg force-pushed the labels-utf8-fix branch 2 times, most recently from 25085ee to 07b9c49 Compare November 27, 2024 18:09
@ywwg ywwg requested a review from beorn7 November 27, 2024 18:09
@ywwg
Copy link
Member Author

ywwg commented Nov 27, 2024

This should not block 3.0.1, but I believe it's ready to go in (pending getting the documentation right!)

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually good to have this in v3.0.1 because it is technically a breaking change (legacy label names could have the form U__...). I don't think it's relevant enough to call it out in the migration guide, though.

Copy link
Contributor

@jan--f jan--f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a follow up, but shouldn't block this fix.

Signed-off-by: Owen Williams <owen.williams@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-3.0 Fixes that need to be backported to release 3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants