Skip to content

Conversation

bboreham
Copy link
Member

@bboreham bboreham commented Apr 30, 2023

This builds on #10991, implementing "Option 5" in the design doc.

The code adds a new labels.Labels implementation which can be selected at build time with -tags dedupelabels.

Previous attempts to do string interning in Prometheus have used a single lookup table for the entire program; this PR allows different sets of labels to have their own tables. Example: where you fetch series via remote-read, or receive series via remote-write with a symbol-table (see #11999).

Most of the code changes outside of the model/labels package are to manage the symbol tables.

Managing concurrent access to symbol tables:

  • When creating new labels, a sync.Mutex is locked around reads and writes.
  • When reading labels, there is no locking because the table of strings used by those labels is immutable.

This branch includes two other PRs #13452 and #13491. Update: now merged.

@colega
Copy link
Contributor

colega commented May 5, 2023

Can you explain in a few words how this deals with the churn of series in the Head?

Copy link
Member Author

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look!

Can you explain in a few words how this deals with the churn of series in the Head?

Not implemented yet. Suggestion: after head compaction and series GC, create a new SymbolTable then run through all live series and rebuild their Labels.
Then reset the SymbolTables used by scraping parsers.

var aName, bName string
aName, ia = decodeString(a.data, ia)
bName, ib = decodeString(b.data, ib)
aName, ia = decodeString(a.syms, a.data, ia)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the two symbol tables are equal, would it be possible to avoid decoding label symbols and compare them directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Equal, yes, but not Compare since we don’t have an ordering in the symbol table.
Might be worth skipping all the strings that are equal and just decoding the first one that’s different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, decoding only what is different would probably speed things up when doing a merge of two sorted label sets. For example, when merging data from two TSDBs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar idea for current main: #12451

@bboreham
Copy link
Member Author

I have rebased the code, and made some important updates since the last review:

  • Prometheus can be built with either -tags stringlabels or -tags internlabels for the version in this PR.
  • The symbol-table used for scraping is cleared out periodically, after a target reload. (TODO: clear it out if targets are never reloaded)
  • Symbol table locking was removed on the read path.

@bboreham

This comment was marked as outdated.

@prombot

This comment was marked as outdated.

@bboreham

This comment was marked as outdated.

@prombot

This comment was marked as outdated.

@bboreham
Copy link
Member Author

Memory usage is ~20% down, which is great; CPU and latency of queries is up a bit which I shall look into.

Comment on lines 92 to 104
// ToName maps an integer to a string.
func (t *nameTable) ToName(num int) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be ToValue, but it's opinionated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we have explicit names and values for labels, I don't think I should prefer one here.
I can't remember why it's a "name table" and not a "string table".

Comment on lines 360 to 374
// Idea: have a constant symbol for blank, then we don't have to look it up.
blank, ok := ls.syms.symbolTable.byName[""]
if !ok { // Symbol table has no entry for blank - none of the values can be blank.
return ls
Copy link
Contributor

@colega colega Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the "Idea:" would perform worse, as if we don't look it up, we have to check whether the constant symbol is present in data. I like this hack.

Edit: maybe if the symbol table is too broad, we always have the empty symbol...
Edit2: actually with a constant symbol like 0x00 we could optimize the first pass lookup checking whether there's a 0-byte in the entire data slice (we would still need to decode varints if it's present, unless we avoid using any 0x00-ending positions in the symbol table)

Suggestion from @colega.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Suggestion from @colega.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
@bboreham bboreham merged commit d2817e9 into prometheus:main Feb 26, 2024
@prymitive
Copy link
Contributor

I've tested v2.51.0 with dedupelabels tag on one server (green) and compared it with another server running v2.51.0 with stringlabels tag.
Initially it all looked great and we traded some cpu for lower memory usage, but after a week it seems that the memory usage is slowly catching up with stringlabels build.

image

File: prometheus
Type: inuse_space
Time: Mar 28, 2024 at 9:16am (GMT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 35.37GB, 65.15% of 54.29GB total
Dropped 651 nodes (cum <= 0.27GB)
Showing top 10 nodes out of 113
      flat  flat%   sum%        cum   cum%
    6.71GB 12.36% 12.36%     6.71GB 12.36%  github.com/prometheus/prometheus/model/labels.(*SymbolTable).toNumUnlocked
    5.52GB 10.16% 22.52%     6.47GB 11.91%  github.com/prometheus/prometheus/tsdb.newMemSeries
    5.29GB  9.74% 32.26%     5.29GB  9.74%  github.com/prometheus/prometheus/scrape.(*scrapeCache).addRef
    3.72GB  6.86% 39.12%     3.72GB  6.86%  github.com/prometheus/prometheus/scrape.(*scrapeCache).trackStaleness
    3.51GB  6.46% 45.57%     3.51GB  6.46%  github.com/prometheus/prometheus/tsdb/chunkenc.NewXORChunk
    3.06GB  5.63% 51.20%     3.06GB  5.63%  github.com/prometheus/prometheus/tsdb/chunkenc.(*bstream).writeByte
    2.36GB  4.35% 55.55%     2.36GB  4.35%  github.com/prometheus/prometheus/tsdb/index.(*MemPostings).Delete
    1.84GB  3.39% 58.94%     1.84GB  3.39%  github.com/prometheus/prometheus/tsdb.(*CircularExemplarStorage).AddExemplar
    1.78GB  3.27% 62.21%     1.78GB  3.27%  github.com/prometheus/prometheus/scrape.(*scrapeCache).addDropped
    1.60GB  2.94% 65.15%     1.60GB  2.94%  github.com/prometheus/prometheus/tsdb.(*txRing).add

@bboreham
Copy link
Member Author

Thanks for looking at this. I really need a bit more context for the profile, to see where those functions were called. Could you upload the raw data?

@prymitive
Copy link
Contributor

prymitive commented Mar 28, 2024

Is an svg ok?

It's here:
pprof001

@bboreham
Copy link
Member Author

Svg is better but I still can't see the line numbers. However in this case I think it's enough; the main growth is under the scrape loop.

@prymitive
Copy link
Contributor

Svg is better but I still can't see the line numbers. However in this case I think it's enough; the main growth is under the scrape loop.

I suspect it's the scrapeCache as it keeps references to labels.

ROUTINE ======================== github.com/prometheus/prometheus/scrape.(*scrapeCache).addDropped in scrape/scrape.go
    1.78GB     1.78GB (flat, cum)  2.94% of Total
         .          .    974:func (c *scrapeCache) addDropped(met []byte) {
      93MB       93MB    975:	iter := c.iter
    1.69GB     1.69GB    976:	c.droppedSeries[string(met)] = &iter
         .          .    977:}
         .          .    978:
         .          .    979:func (c *scrapeCache) getDropped(met []byte) bool {
         .          .    980:	iterp, ok := c.droppedSeries[string(met)]
         .          .    981:	if ok {
ROUTINE ======================== github.com/prometheus/prometheus/scrape.(*scrapeCache).addRef in scrape/scrape.go
    5.31GB     5.31GB (flat, cum)  8.78% of Total
         .          .    967:func (c *scrapeCache) addRef(met []byte, ref storage.SeriesRef, lset labels.Labels, hash uint64) {
         .          .    968:	if ref == 0 {
         .          .    969:		return
         .          .    970:	}
    5.31GB     5.31GB    971:	c.series[string(met)] = &cacheEntry{ref: ref, lastIter: c.iter, lset: lset, hash: hash}
         .          .    972:}
         .          .    973:
         .          .    974:func (c *scrapeCache) addDropped(met []byte) {
         .          .    975:	iter := c.iter
         .          .    976:	c.droppedSeries[string(met)] = &iter
ROUTINE ======================== github.com/prometheus/prometheus/scrape.(*scrapeCache).trackStaleness in scrape/scrape.go
    3.75GB     3.75GB (flat, cum)  6.20% of Total
         .          .    987:func (c *scrapeCache) trackStaleness(hash uint64, lset labels.Labels) {
    3.75GB     3.75GB    988:	c.seriesCur[hash] = lset
         .          .    989:}
         .          .    990:
         .          .    991:func (c *scrapeCache) forEachStale(f func(labels.Labels) bool) {
         .          .    992:	for h, lset := range c.seriesPrev {
         .          .    993:		if _, ok := c.seriesCur[h]; !ok {

Could it be possible that SymbolTable instances are being kept referenced there for too long (forever?) ?

@bboreham
Copy link
Member Author

bboreham commented Apr 2, 2024

Yes that is a good point: scrapeCache .. cacheEntry .. lset will hold SymbolTables alive.

I could flush the cache at the point it decides to discard the old SymbolTable.
https://github.com/prometheus/prometheus/blob/9e2c335bab/scrape/scrape.go#L359

@prymitive
Copy link
Contributor

I wonder if the unique package added to the next go release would help here - https://pkg.go.dev/unique@master

@bboreham
Copy link
Member Author

That one gets you to 8 bytes per string, or 16 per label.
This PR is more like 4.5 per label.

However, it might be preferable considering the amount of work needed to get this one to behave.

@bboreham
Copy link
Member Author

Small update:
All the fixes to continuous memory growth have landed, and I plan to push a a 2.54.0+dedupelabels image next week.
According to PromBench at #14462 the memory saving is now only around 10%, which I think is because mem_series got bigger.
I am experimenting with a couple of ideas to bring that back.

(If you have a lot of long and non-unique labels then your saving will look better)

@bboreham bboreham deleted the labels-symboltable branch July 12, 2024 11:16
@bboreham
Copy link
Member Author

The new Docker image is prom/prometheus:v2.54.0-dedupelabels - did anyone else try it?

@SuperQ
Copy link
Member

SuperQ commented Aug 20, 2024

Not yet, I will give it a try soon

@prymitive
Copy link
Contributor

The new Docker image is prom/prometheus:v2.54.0-dedupelabels - did anyone else try it?

I have a pair of identical servers running 2.54.0 - one with stringlabels & the other with dedupelabels, I think I need to give it a week or two to tell anything conclusive since last time I tried it it was slowly using more memory over time - #12304 (comment)

image

So far the heap profile looks good, without SymbolTable getting to the top:

File: prometheus
Type: inuse_space
Time: Aug 23, 2024 at 9:12am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 30.53GB, 61.02% of 50.02GB total
Dropped 742 nodes (cum <= 0.25GB)
Showing top 10 nodes out of 127
      flat  flat%   sum%        cum   cum%
    5.23GB 10.46% 10.46%     5.23GB 10.46%  github.com/prometheus/prometheus/scrape.(*scrapeCache).addRef
    4.99GB  9.97% 20.43%     5.91GB 11.82%  github.com/prometheus/prometheus/tsdb.newMemSeries
    3.80GB  7.60% 28.03%     3.80GB  7.60%  github.com/prometheus/prometheus/scrape.(*scrapeCache).trackStaleness
    3.52GB  7.04% 35.07%     3.52GB  7.04%  github.com/prometheus/prometheus/tsdb/chunkenc.NewXORChunk
    2.73GB  5.46% 40.53%     2.73GB  5.46%  github.com/prometheus/prometheus/tsdb/chunkenc.(*bstream).writeByte
    2.58GB  5.17% 45.69%     2.58GB  5.17%  github.com/prometheus/prometheus/model/labels.(*SymbolTable).toNumUnlocked
    2.14GB  4.27% 49.96%     2.14GB  4.27%  github.com/prometheus/prometheus/tsdb/index.(*MemPostings).Delete.func1
    1.90GB  3.79% 53.75%     1.90GB  3.79%  github.com/prometheus/prometheus/tsdb.NewCircularExemplarStorage
    1.83GB  3.67% 57.42%     4.41GB  8.82%  github.com/prometheus/prometheus/model/labels.(*ScratchBuilder).Labels
    1.80GB  3.60% 61.02%     1.80GB  3.60%  github.com/prometheus/prometheus/scrape.(*scrapeCache).addDropped

@bboreham
Copy link
Member Author

Hi, I hope to give a lightning update at PromCon today - please post any statistics from people trying it out.

@prymitive
Copy link
Contributor

Hi, I hope to give a lightning update at PromCon today - please post any statistics from people trying it out.

I'm on leave most of September so I didn't pay much attention to these instances but here's a screenshot of a comparison for our two instances - one vanilla and the other is deduplabels.
It covers last 30 days and you can clearly see the moment when one instance was converted to deduplabels (410dm7). Up to that point both had pretty much same usage patterns, after than deduplabels uses less memory but ~10% more cpu. Memory usage of deduplabels is a lot less flat - which isn't unexpected given the need to obsessionally swap symbols tables etc.

screencapture-grafana-cfdata-org-d-em0ga5f4k-prometheus-performance-comparison-2024-09-12-11_29_13 (1)

@prymitive
Copy link
Contributor

Here's a latest screenshot. Although there is a reduction in memory usage dedupelabels is a lot more unpredictable with how much memory will be used, compared to pretty much flat memory usage of stringlabels, which is not unexpected given how dedupelabels works.
I don't think I would be very eager to swap stringlabels to dedupelabels on our servers because of that, some memory savings are great and we can spare extra ~10% cpu to gain it, but I much more value stable resource usage pattern of stringlabels over that. Plus there would always be a worry that some service will hit the worst case scenario usage pattern for dedupelabels causing occasional OOM kills.

screencapture-grafana-cfdata-org-d-em0ga5f4k-prometheus-performance-comparison-2024-10-02-10_42_35

@bboreham
Copy link
Member Author

If I understand your charts correctly, there is a point around 09/16 where dedupelabels memory goes ~5GB above stringlabels. This has to be a bug somewhere; nothing about the design would lead to that result.

A profile at the time where memory goes higher would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants