Skip to content

Conversation

prymitive
Copy link
Contributor

This is a cleaned up version of #15988 implemented behind a new build tag toplabels.

prymitive added 4 commits May 12, 2025 09:13
Since labels_stringlabels.go is now the default implementation we should rename label files to make this more obvious.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
This functions are specific to the default labels implementation (used to be known as stringlabels).

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
stringlabels stores all time series labels as a single string using this format:

`<length><name><length><value>[<length><name><length><value> ...]`

So a label set for my_metric{job=foo, instance="bar", env="prod", blank=""} would be encoded as:

`[8]__name__[9]my_metric[3]job[3]foo[8]instance[3]bar[3]env[4]prod[5]blank[0]`

This is a huge improvement over 'classic' labels implementation that stores all label names & values as seperate strings. There is some room for improvement though since some string are present more often than others. For example `__name__` will be present for all label sets of every time series we store in HEAD, eating 1+8=9 bytes. Since `__name__` is well known string we can try to use a single byte to store it in our encoded string, rather than repeat it in full each time. To be able to store strings that are short cut into a single byte we need to somehow signal that to the reader of the encoded string, for that we use the fact that zero length strings are rare and generaly not stored on time series. If we have an encoded string with zero length then this will now signal that it represents a mapped value - to learn the true value of this string we need to read the next byte which gives us index in a static mapping. That mapping must include empty string, so that we can still encode empty strings using this scheme.

Example of our mapping (minimal version):

```
0: ""
1: "__name__"
2: "instance"
3: "job"
```

With that mapping our example label set would be encoded as:

`[0]1[9]my_metric[0]3[3]foo[0]2[3]bar[3]env[4]prod[5]blank[0]0`

Which would mean 40 bytes instead of 56.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
This will populate the static mapping of strings to store as a single byte on startup.
We use the last TSDB block as the source of data, iterate the index for each label and count how many time series given label pair is referencing.

We need to call mapCommonLabelSymbols() once TSDB opens all blocks, but before we start to reply the WAL and populate the HEAD.
There doesn't seem to be a way to do this right now, so add a hook we can use for it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant