Skip to content

Proposal: add query sharding support to TSDB #10420

@pracucci

Description

@pracucci

Proposal

In November 2021, Grafana Labs announced query sharding for Grafana Enterprise Metrics (GEM).

The query sharding implementation leverages some changes we did in TSDB to be able to selectively query a given shard, so that multiple processes (running on different machines) can query different shards from the same copy of a TSDB block.

A shard is a consistent set of series. The same series belongs to the same shard across different blocks. To do it, we use a simple hash mod algorithm to find out which shard each series belongs to.

We propose to upstream to Prometheus TSDB the changes we did to support query sharding.

The proposed changes target TSDB as a library. They can be used by other OSS projects as a foundation to build query sharding capabilities.

Use case. Why is this important?

Query sharding can significantly reduce the time required to execute high cardinality and/or CPU intensive queries. Running GEM at Grafana Labs, we’ve seen a 10x reduction in query latency for high cardinality queries (up to 30x for some special cases).

How query sharding works in GEM

Let’s start with an example. Consider this simple query:

sum(rate(metric[1m]))

Query sharding splits the query into N partial queries, where N is the number of shards (3 in the example) and each partial query runs the sum(rate()) on a different set of series (a shard). Each partial query is executed by a different process (typically running on different machines) and the result of each partial query is then concatenated and a final sum() aggregation is run on top of it:

sum(
  concat(
    sum(rate(metric{__query_shard__="1_of_3"}[1m]))
    sum(rate(metric{__query_shard__="2_of_3"}[1m]))
    sum(rate(metric{__query_shard__="3_of_3"}[1m]))
  )
)

Note: the concat() function doesn’t exist in PromQL and is not a real function. I’ve used it to give an idea of the series concatenation process.

How GEM queries different shards from the same TSDB block

For a given block (containing all series for a time period, eg. 2h), a partial query needs to fetch chunks only from a subset of the series, which are the series matching the given shard. To do it, we added a couple of options to storage.SelectHints:

  • ShardIndex: the ID of the shard to query from the block (ranges between 0 and ShardCount-1). Only series whose hash mod matches the ShardIndex will be considered when running the Select().
  • ShardCount: Total number of shards.

Proposed changes in Prometheus TSDB

To add sharding capability to TSDB querier’s Select(), we propose the following changes.

Add ShardIndex and ShardCount to storage.SelectHints:

type SelectHints struct {
	// …

	ShardIndex uint64 // Current shard index (starts from 0 and up to ShardCount-1).
	ShardCount uint64 // Total number of shards (0 means sharding is disabled).
}

Add ShardedPostings() to tsdb.IndexReader. ShardedPostings() works similar to SortedPostings() but, instead of sorting postings, it filters postings by the given shard index and count:

type IndexReader interface {
	// …

	// ShardedPostings returns a postings list filtered by the provided shardIndex
	// out of shardCount. For a given posting, its shard MUST be computed hashing
	// the series labels mod shardCount (eg. `labels.Hash() % shardCount == shardIndex`).
	ShardedPostings(p index.Postings, shardIndex, shardCount uint64) index.Postings
}

Finally, call ShardedPostings() from blockQuerier.Select() and blockChunkQuerier.Select().

To give a you a better understanding of this proposal, I’ve opened a PR showing a reference implementation: #10421

Follow up work

In case this proposal is accepted, there’s some follow up work we commit to upstream too, in particular:

Series hash caching

When querying the TSDB head, the series hash is already in memory, but when querying a TSDB block the series hash needs to be computed each time. Computing the series hash is an expensive operation.

To overcome this, we’ve built an optional, configurable and fast in-memory cache for the series hashes, which we’ll propose as a follow up PR.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions