Labels: reduce memory by de-duplicating strings in multiple SymbolTables #12304

bboreham · 2023-04-30T14:54:43Z

This builds on #10991, implementing "Option 5" in the design doc.

The code adds a new labels.Labels implementation which can be selected at build time with -tags dedupelabels.

Previous attempts to do string interning in Prometheus have used a single lookup table for the entire program; this PR allows different sets of labels to have their own tables. Example: where you fetch series via remote-read, or receive series via remote-write with a symbol-table (see #11999).

Most of the code changes outside of the model/labels package are to manage the symbol tables.

Managing concurrent access to symbol tables:

When creating new labels, a sync.Mutex is locked around reads and writes.
When reading labels, there is no locking because the table of strings used by those labels is immutable.

~~This branch includes two other PRs #13452 and #13491.~~ Update: now merged.

model/labels/labels_string.go

util/deepequal/deepequal.go

model/labels/labels_string.go

colega · 2023-05-05T10:48:08Z

Can you explain in a few words how this deals with the churn of series in the Head?

model/histogram/float_histogram.go

bboreham

Thanks for taking a look!

Can you explain in a few words how this deals with the churn of series in the Head?

Not implemented yet. Suggestion: after head compaction and series GC, create a new SymbolTable then run through all live series and rebuild their Labels.
Then reset the SymbolTables used by scraping parsers.

model/histogram/float_histogram.go

util/deepequal/deepequal.go

model/labels/labels_string.go

fpetkovski · 2023-05-13T10:08:37Z

model/labels/labels_string.go

 		var aName, bName string
-		aName, ia = decodeString(a.data, ia)
-		bName, ib = decodeString(b.data, ib)
+		aName, ia = decodeString(a.syms, a.data, ia)


If the two symbol tables are equal, would it be possible to avoid decoding label symbols and compare them directly?

For Equal, yes, but not Compare since we don’t have an ordering in the symbol table.
Might be worth skipping all the strings that are equal and just decoding the first one that’s different.

That's a good point, decoding only what is different would probably speed things up when doing a merge of two sorted label sets. For example, when merging data from two TSDBs.

Similar idea for current main: #12451

bboreham · 2023-11-24T20:08:10Z

I have rebased the code, and made some important updates since the last review:

Prometheus can be built with either -tags stringlabels or -tags internlabels for the version in this PR.
The symbol-table used for scraping is cleared out periodically, after a target reload. (TODO: clear it out if targets are never reloaded)
Symbol table locking was removed on the read path.

bboreham · 2023-11-30T15:33:12Z

Memory usage is ~20% down, which is great; CPU and latency of queries is up a bit which I shall look into.

model/labels/labels_internlabels.go

colega · 2023-11-30T17:29:24Z

model/labels/labels_internlabels.go

+// ToName maps an integer to a string.
+func (t *nameTable) ToName(num int) string {


I feel like this should be ToValue, but it's opinionated.

Given we have explicit names and values for labels, I don't think I should prefer one here.
I can't remember why it's a "name table" and not a "string table".

model/labels/labels_internlabels.go

colega · 2023-12-01T08:46:12Z

model/labels/labels_internlabels.go

+	// Idea: have a constant symbol for blank, then we don't have to look it up.
+	blank, ok := ls.syms.symbolTable.byName[""]
+	if !ok { // Symbol table has no entry for blank - none of the values can be blank.
+		return ls


I think that the "Idea:" would perform worse, as if we don't look it up, we have to check whether the constant symbol is present in data. I like this hack.

Edit: maybe if the symbol table is too broad, we always have the empty symbol...
Edit2: actually with a constant symbol like 0x00 we could optimize the first pass lookup checking whether there's a 0-byte in the entire data slice (we would still need to decode varints if it's present, unless we avoid using any 0x00-ending positions in the symbol table)

@colega

Suggestion from @colega. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

@colega

Suggestion from @colega. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

prymitive · 2024-03-28T09:28:50Z

I've tested v2.51.0 with dedupelabels tag on one server (green) and compared it with another server running v2.51.0 with stringlabels tag.
Initially it all looked great and we traded some cpu for lower memory usage, but after a week it seems that the memory usage is slowly catching up with stringlabels build.

File: prometheus
Type: inuse_space
Time: Mar 28, 2024 at 9:16am (GMT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 35.37GB, 65.15% of 54.29GB total
Dropped 651 nodes (cum <= 0.27GB)
Showing top 10 nodes out of 113
      flat  flat%   sum%        cum   cum%
    6.71GB 12.36% 12.36%     6.71GB 12.36%  github.com/prometheus/prometheus/model/labels.(*SymbolTable).toNumUnlocked
    5.52GB 10.16% 22.52%     6.47GB 11.91%  github.com/prometheus/prometheus/tsdb.newMemSeries
    5.29GB  9.74% 32.26%     5.29GB  9.74%  github.com/prometheus/prometheus/scrape.(*scrapeCache).addRef
    3.72GB  6.86% 39.12%     3.72GB  6.86%  github.com/prometheus/prometheus/scrape.(*scrapeCache).trackStaleness
    3.51GB  6.46% 45.57%     3.51GB  6.46%  github.com/prometheus/prometheus/tsdb/chunkenc.NewXORChunk
    3.06GB  5.63% 51.20%     3.06GB  5.63%  github.com/prometheus/prometheus/tsdb/chunkenc.(*bstream).writeByte
    2.36GB  4.35% 55.55%     2.36GB  4.35%  github.com/prometheus/prometheus/tsdb/index.(*MemPostings).Delete
    1.84GB  3.39% 58.94%     1.84GB  3.39%  github.com/prometheus/prometheus/tsdb.(*CircularExemplarStorage).AddExemplar
    1.78GB  3.27% 62.21%     1.78GB  3.27%  github.com/prometheus/prometheus/scrape.(*scrapeCache).addDropped
    1.60GB  2.94% 65.15%     1.60GB  2.94%  github.com/prometheus/prometheus/tsdb.(*txRing).add

bboreham · 2024-03-28T09:35:28Z

Thanks for looking at this. I really need a bit more context for the profile, to see where those functions were called. Could you upload the raw data?

prymitive · 2024-03-28T09:45:16Z

Is an svg ok?

It's here:

bboreham · 2024-03-28T09:49:40Z

Svg is better but I still can't see the line numbers. However in this case I think it's enough; the main growth is under the scrape loop.

prymitive · 2024-04-02T14:07:55Z

Svg is better but I still can't see the line numbers. However in this case I think it's enough; the main growth is under the scrape loop.

I suspect it's the scrapeCache as it keeps references to labels.

ROUTINE ======================== github.com/prometheus/prometheus/scrape.(*scrapeCache).addDropped in scrape/scrape.go
    1.78GB     1.78GB (flat, cum)  2.94% of Total
         .          .    974:func (c *scrapeCache) addDropped(met []byte) {
      93MB       93MB    975:	iter := c.iter
    1.69GB     1.69GB    976:	c.droppedSeries[string(met)] = &iter
         .          .    977:}
         .          .    978:
         .          .    979:func (c *scrapeCache) getDropped(met []byte) bool {
         .          .    980:	iterp, ok := c.droppedSeries[string(met)]
         .          .    981:	if ok {
ROUTINE ======================== github.com/prometheus/prometheus/scrape.(*scrapeCache).addRef in scrape/scrape.go
    5.31GB     5.31GB (flat, cum)  8.78% of Total
         .          .    967:func (c *scrapeCache) addRef(met []byte, ref storage.SeriesRef, lset labels.Labels, hash uint64) {
         .          .    968:	if ref == 0 {
         .          .    969:		return
         .          .    970:	}
    5.31GB     5.31GB    971:	c.series[string(met)] = &cacheEntry{ref: ref, lastIter: c.iter, lset: lset, hash: hash}
         .          .    972:}
         .          .    973:
         .          .    974:func (c *scrapeCache) addDropped(met []byte) {
         .          .    975:	iter := c.iter
         .          .    976:	c.droppedSeries[string(met)] = &iter
ROUTINE ======================== github.com/prometheus/prometheus/scrape.(*scrapeCache).trackStaleness in scrape/scrape.go
    3.75GB     3.75GB (flat, cum)  6.20% of Total
         .          .    987:func (c *scrapeCache) trackStaleness(hash uint64, lset labels.Labels) {
    3.75GB     3.75GB    988:	c.seriesCur[hash] = lset
         .          .    989:}
         .          .    990:
         .          .    991:func (c *scrapeCache) forEachStale(f func(labels.Labels) bool) {
         .          .    992:	for h, lset := range c.seriesPrev {
         .          .    993:		if _, ok := c.seriesCur[h]; !ok {

Could it be possible that SymbolTable instances are being kept referenced there for too long (forever?) ?

bboreham · 2024-04-02T15:36:50Z

Yes that is a good point: scrapeCache .. cacheEntry .. lset will hold SymbolTables alive.

I could flush the cache at the point it decides to discard the old SymbolTable.
https://github.com/prometheus/prometheus/blob/9e2c335bab/scrape/scrape.go#L359

prymitive · 2024-06-21T18:40:25Z

I wonder if the unique package added to the next go release would help here - https://pkg.go.dev/unique@master

bboreham · 2024-06-21T18:43:38Z

That one gets you to 8 bytes per string, or 16 per label.
This PR is more like 4.5 per label.

However, it might be preferable considering the amount of work needed to get this one to behave.

bboreham · 2024-07-12T11:16:24Z

Small update:
All the fixes to continuous memory growth have landed, and I plan to push a a 2.54.0+dedupelabels image next week.
According to PromBench at #14462 the memory saving is now only around 10%, which I think is because mem_series got bigger.
I am experimenting with a couple of ideas to bring that back.

(If you have a lot of long and non-unique labels then your saving will look better)

bboreham · 2024-08-20T17:42:24Z

The new Docker image is prom/prometheus:v2.54.0-dedupelabels - did anyone else try it?

SuperQ · 2024-08-20T19:13:51Z

Not yet, I will give it a try soon

prymitive · 2024-08-23T09:13:57Z

The new Docker image is prom/prometheus:v2.54.0-dedupelabels - did anyone else try it?

I have a pair of identical servers running 2.54.0 - one with stringlabels & the other with dedupelabels, I think I need to give it a week or two to tell anything conclusive since last time I tried it it was slowly using more memory over time - #12304 (comment)

So far the heap profile looks good, without SymbolTable getting to the top:

File: prometheus
Type: inuse_space
Time: Aug 23, 2024 at 9:12am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 30.53GB, 61.02% of 50.02GB total
Dropped 742 nodes (cum <= 0.25GB)
Showing top 10 nodes out of 127
      flat  flat%   sum%        cum   cum%
    5.23GB 10.46% 10.46%     5.23GB 10.46%  github.com/prometheus/prometheus/scrape.(*scrapeCache).addRef
    4.99GB  9.97% 20.43%     5.91GB 11.82%  github.com/prometheus/prometheus/tsdb.newMemSeries
    3.80GB  7.60% 28.03%     3.80GB  7.60%  github.com/prometheus/prometheus/scrape.(*scrapeCache).trackStaleness
    3.52GB  7.04% 35.07%     3.52GB  7.04%  github.com/prometheus/prometheus/tsdb/chunkenc.NewXORChunk
    2.73GB  5.46% 40.53%     2.73GB  5.46%  github.com/prometheus/prometheus/tsdb/chunkenc.(*bstream).writeByte
    2.58GB  5.17% 45.69%     2.58GB  5.17%  github.com/prometheus/prometheus/model/labels.(*SymbolTable).toNumUnlocked
    2.14GB  4.27% 49.96%     2.14GB  4.27%  github.com/prometheus/prometheus/tsdb/index.(*MemPostings).Delete.func1
    1.90GB  3.79% 53.75%     1.90GB  3.79%  github.com/prometheus/prometheus/tsdb.NewCircularExemplarStorage
    1.83GB  3.67% 57.42%     4.41GB  8.82%  github.com/prometheus/prometheus/model/labels.(*ScratchBuilder).Labels
    1.80GB  3.60% 61.02%     1.80GB  3.60%  github.com/prometheus/prometheus/scrape.(*scrapeCache).addDropped

bboreham · 2024-09-12T08:21:23Z

Hi, I hope to give a lightning update at PromCon today - please post any statistics from people trying it out.

prymitive · 2024-09-12T10:34:05Z

Hi, I hope to give a lightning update at PromCon today - please post any statistics from people trying it out.

I'm on leave most of September so I didn't pay much attention to these instances but here's a screenshot of a comparison for our two instances - one vanilla and the other is deduplabels.
It covers last 30 days and you can clearly see the moment when one instance was converted to deduplabels (410dm7). Up to that point both had pretty much same usage patterns, after than deduplabels uses less memory but ~10% more cpu. Memory usage of deduplabels is a lot less flat - which isn't unexpected given the need to obsessionally swap symbols tables etc.

prymitive · 2024-10-02T10:00:27Z

Here's a latest screenshot. Although there is a reduction in memory usage dedupelabels is a lot more unpredictable with how much memory will be used, compared to pretty much flat memory usage of stringlabels, which is not unexpected given how dedupelabels works.
I don't think I would be very eager to swap stringlabels to dedupelabels on our servers because of that, some memory savings are great and we can spare extra ~10% cpu to gain it, but I much more value stable resource usage pattern of stringlabels over that. Plus there would always be a worry that some service will hit the worst case scenario usage pattern for dedupelabels causing occasional OOM kills.

bboreham · 2025-04-22T11:48:16Z

If I understand your charts correctly, there is a point around 09/16 where dedupelabels memory goes ~5GB above stringlabels. This has to be a bug somewhere; nothing about the design would lead to that result.

A profile at the time where memory goes higher would be useful.

colega reviewed May 5, 2023

View reviewed changes

model/histogram/float_histogram.go Outdated Show resolved Hide resolved

bboreham commented May 5, 2023

View reviewed changes

model/histogram/float_histogram.go Outdated Show resolved Hide resolved

util/deepequal/deepequal.go Outdated Show resolved Hide resolved

model/labels/labels_string.go Outdated Show resolved Hide resolved

model/labels/labels_string.go Outdated Show resolved Hide resolved

fpetkovski reviewed May 13, 2023

View reviewed changes

fpetkovski mentioned this pull request May 26, 2023

Make Thanos compatible with Prometheus stringlabels thanos-io/thanos#6394

Draft

2 tasks

bboreham mentioned this pull request Jul 18, 2023

Share string interner between scrape cache and remote_write #10231

Closed

bboreham force-pushed the labels-symboltable branch 2 times, most recently from edc771c to d7f1d83 Compare October 12, 2023 16:23

bboreham mentioned this pull request Oct 29, 2023

Labels: reduce allocations when creating from TSDB WAL #13044

Merged

bboreham force-pushed the labels-symboltable branch from 061364b to c6bf0fd Compare November 13, 2023 18:18

bboreham force-pushed the labels-symboltable branch 2 times, most recently from f3164cb to b1b4874 Compare November 24, 2023 20:05

bboreham mentioned this pull request Nov 28, 2023

Caching all symbols read from chunks #12587

Open

bboreham force-pushed the labels-symboltable branch from b1b4874 to b43402d Compare November 28, 2023 19:25

This comment was marked as outdated.

Sign in to view

prombot added the prombench label Nov 29, 2023

This comment was marked as outdated.

Sign in to view