Proposal: add query sharding support to TSDB

## Proposal
In November 2021, Grafana Labs [announced query sharding](https://grafana.com/blog/2021/11/09/observabilitycon-announcements/) for Grafana Enterprise Metrics (GEM).

The query sharding implementation leverages some changes we did in TSDB to be able to selectively query a given shard, so that multiple processes (running on different machines) can query different shards from the same copy of a TSDB block.

A shard is a consistent set of series. The same series belongs to the same shard across different blocks. To do it, we use a simple hash mod algorithm to find out which shard each series belongs to.

We propose to upstream to Prometheus TSDB the changes we did to support query sharding. 

The proposed changes target TSDB as a library. They can be used by other OSS projects as a foundation to build query sharding capabilities.

**Use case. Why is this important?**

Query sharding can significantly reduce the time required to execute high cardinality and/or CPU intensive queries. Running GEM at Grafana Labs, we’ve seen a 10x reduction in query latency for high cardinality queries (up to 30x for some special cases).

### How query sharding works in GEM

Let’s start with an example. Consider this simple query:

```
sum(rate(metric[1m]))
```

Query sharding splits the query into N partial queries, where N is the number of shards (3 in the example) and each partial query runs the `sum(rate())` on a different set of series (a shard). Each partial query is executed by a different process (typically running on different machines) and the result of each partial query is then concatenated and a final `sum()` aggregation is run on top of it:

```
sum(
  concat(
    sum(rate(metric{__query_shard__="1_of_3"}[1m]))
    sum(rate(metric{__query_shard__="2_of_3"}[1m]))
    sum(rate(metric{__query_shard__="3_of_3"}[1m]))
  )
)
```

> Note: the `concat()` function doesn’t exist in PromQL and is not a real function. I’ve used it to give an idea of the series concatenation process.

### How GEM queries different shards from the same TSDB block

For a given block (containing all series for a time period, eg. 2h), a partial query needs to fetch chunks only from a subset of the series, which are the series matching the given shard. To do it, we added a couple of options to `storage.SelectHints`:

- ShardIndex: the ID of the shard to query from the block (ranges between 0 and ShardCount-1). Only series whose hash mod matches the ShardIndex will be considered when running the `Select()`.
- ShardCount: Total number of shards.

### Proposed changes in Prometheus TSDB

To add sharding capability to TSDB querier’s `Select()`, we propose the following changes.

Add `ShardIndex` and `ShardCount` to `storage.SelectHints`:

```golang
type SelectHints struct {
	// …

	ShardIndex uint64 // Current shard index (starts from 0 and up to ShardCount-1).
	ShardCount uint64 // Total number of shards (0 means sharding is disabled).
}
```

Add `ShardedPostings()` to `tsdb.IndexReader`. `ShardedPostings()` works similar to `SortedPostings()` but, instead of sorting postings, it filters postings by the given shard index and count:

```golang
type IndexReader interface {
	// …

	// ShardedPostings returns a postings list filtered by the provided shardIndex
	// out of shardCount. For a given posting, its shard MUST be computed hashing
	// the series labels mod shardCount (eg. `labels.Hash() % shardCount == shardIndex`).
	ShardedPostings(p index.Postings, shardIndex, shardCount uint64) index.Postings
}
```

Finally, call `ShardedPostings()` from `blockQuerier.Select()` and `blockChunkQuerier.Select()`.

To give a you a better understanding of this proposal, I’ve opened a PR showing a **reference implementation**: https://github.com/prometheus/prometheus/pull/10421

### Follow up work

In case this proposal is accepted, there’s some follow up work we commit to upstream too, in particular:

#### Series hash caching

When querying the TSDB head, the series hash is already in memory, but when querying a TSDB block the series hash needs to be computed each time. Computing the series hash is an expensive operation.

To overcome this, we’ve built an optional, configurable and fast in-memory cache for the series hashes, which we’ll propose as a follow up PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: add query sharding support to TSDB #10420

Proposal

How query sharding works in GEM

How GEM queries different shards from the same TSDB block

Proposed changes in Prometheus TSDB

Follow up work

Series hash caching

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: add query sharding support to TSDB #10420

Description

Proposal

How query sharding works in GEM

How GEM queries different shards from the same TSDB block

Proposed changes in Prometheus TSDB

Follow up work

Series hash caching

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions