Skip to content

scrape (histograms): Investigate (and address) protobuf scraping performance problems #14668

@beorn7

Description

@beorn7

Proposal

tl;dr: To scrape native histograms, the scraping has to happen using protobuf. The current protobuf parsing creates a lot of memory churn. In practice, that mostly boils down to increased CPU usage for garbage collection. Depending on the exact usage pattern of a Prometheus server, the CPU usage may grow significantly (we have seen a 2x increase in real-world use-cases). This can hit users hard if they "simply" switch on native histograms, and it will be especially surprising if they only want to ingest a few native histograms (but via activating the feature flag, they switch all scraping to protobuf).

An important implication for declaring native histograms a stable feature: We cannot simply "switch on native histograms" by default if that implies that all scrapes are using protobuf scraping while protobuf scraping uses so much more CPU as we have seen now. (Or in other words: This issue needs to get resolved to declare native histograms a stable feature.)

Various pieces of context

  • Prometheus currently uses the (unmaintained) gogo-protobuf library. It already has a much better performance than the official Go protobuf library. Independent from this issue, we have plans (and need) to migrate away from gogo-protobuf to an actually maintained library. It is preferred to get this done soon, and to start the performance optimizations with the new library to avoid redoing it twice (before and after migrating).
  • To get to the ground truth, we need to do proper profiling first. Having said that, I assume that the source of the problem is that most expositions are still dominated strongly by strings (metric names, label names, label values), and the text parser is highly optimized to efficiently ingest those strings, while the generated protobuf parsing code is creating separate Go strings for all those strings in the protobuf before our own label handling code can even kick in.
  • The current parsing and scrape code is highly intertwined without much abstraction barriers, utilizing specifics of the text format to achieve high ingestion performance. For example, if the part of the line that defines the metric (name, labels) hasn't changed compared to the last scrape, there is no parsing happening and the code just re-uses the already known label set. The protobuf parsing has to work against this approach and essentially simulate a line-based layout from parsing the protobuf exposition. For one, the protobuf parsing (including allocating all the strings) is happening upon each scrape over and over again, and on top of that, the parsing code has to emulate a "line based view" of the exposition so that it can be ingested properly.

Immediate work-around

In a scenario where only selected few endpoints expose native histograms, users can switch back to text based scraping on endpoints where they don't expect native histograms, using the scrape_protocols setting in the scrape config.

Work-around in the not too far future

Once native histograms can be scraped via the text format, we can make that the default (or users can switch to it). Even though heavy native histograms might be less efficient to scrape in the text format, most expositions will still be dominated by conventional metrics so that overall, the performance impact should be much more manageable than that of a complete switch to protobuf for everything.

Actual solutions

In the midterm, we should anyway refactor the scrape code to not be "hardwired to the text format", which might already give us a performance gain for protobuf (while still aiming for not losing much performance when scraping the text format).

In general, we need to address the bottlenecks as identified by profiling. In the ideal case, we can optimize things just by changing the way we use our protobuf library of choice. Less ideal (and more likely) is the case where we need a more invasive approach to avoid allocating many strings for each scrape over and over again. The most extreme approach would be to hand-code the entire protobuf decoding.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions