Implement a prometheus output for test-metrics scraping #607

petemounce · 2020-08-24T22:46:52Z

Checklist

make test-all (UNIX) passes. CI will also test this
unit and/or integration tests are included (if applicable)
documentation is changed or added (if applicable)

Description of change

Fixes #175, #362, #363.

Adds a prometheus output format. I want to be able to scrape my goss serve endpoints so that I can

get metrics for the overall suite-run duration (summary equivalent)
- ... so I can adjust the timeout my healthchecks have against cloud instances so the timeout is comfortably larger than the goss run duration
- ... so I can see overall outcomes over time
get outcome metrics
- ... so I can alert on fails based on when they start to appear, triggering an investigation into what caused my state to drift from "good"
  - note - it is an anti-goal to alert on specific assertion IDs because of the metrics cardinality that would result in. See description below and discussion inside comments
- ... so I can understand on a more granular level whether there are duration fluctuations or outcome spikes across assertion types

This records

overall test-suite run outcome & duration
test outcomes
test durations
... where both test outcomes and durations can be grouped by outcome and assertion-type.

Sample (from unit-test log):

# TYPE goss_tests_outcomes_duration_seconds counter
goss_tests_outcomes_duration_seconds{outcome="fail",type="command"} 0.01
goss_tests_outcomes_duration_seconds{outcome="pass",type="command"} 0.02
goss_tests_outcomes_duration_seconds{outcome="skip",type="file"} 0.01
# HELP goss_tests_outcomes_total The number of test-outcomes from this run.
# TYPE goss_tests_outcomes_total counter
goss_tests_outcomes_total{outcome="fail",type="command"} 1
goss_tests_outcomes_total{outcome="pass",type="command"} 2
goss_tests_outcomes_total{outcome="skip",type="file"} 1
# HELP goss_tests_run_duration_seconds The end-to-end duration of this run.
# TYPE goss_tests_run_duration_seconds counter
goss_tests_run_duration_seconds 60

Implementation details and notes

I expect to be using this most in the context of goss serve --format prometheus, so that I can configure my prometheus servers to scrape servers that I'm running goss endpoints on. Therefore https://prometheus.io/docs/practices/instrumentation/#online-serving-systems was the most appropriate model to choose.

I followed the advice on metric and label naming (https://prometheus.io/docs/practices/naming/) (~~er, actually I didn't initially and will push a fix for that~~ fixed in
1f2cd75).

I followed the advice on Labels (https://prometheus.io/docs/practices/instrumentation/#use-labels) and not to overuse them (https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels) - I could have added a label for each field of a resource.TestResult, but I didn't think I'd need that ever, and I definitely didn't think it would be worth the cardinality explosion that would have been at risk if I had.

I chose to not use the prometheus Default Registry, because this also adds in process and golang metrics that I don't think I need. This is the registry = prometheus.NewRegistry() and .With(registry) bits inside the var block. It's easy to swap back to the defaults again, and the change in doing so would be additive, not breaking, to the output.

I chose to reverse that, since it was simply easier to then also have an always-on /metrics endpoint. That's what that ecosystem expects and uses by default.

petemounce · 2020-08-24T23:04:22Z

@aelsabbahy this is ready for review, I think. If it's a welcome PR, then the following people might be interested too based on the previous tickets I've seen. As with #585, I'll let you decide whether/when to do that.

Prometheus Output #175 pysysops beevelop alexrudd
Add Prometheus HTTP Endpoint #362 & Prometheus output #363 Smithx10 pysysops karimiehsan90 harre-orz timeu timbirk berney SuperQ mleyden

Smithx10 · 2020-08-26T20:37:40Z

@petemounce I think its important for visualization over time that the operator can discover what goss check failed and when.

The format in PR #175, looks like it would provide the operator with all the information they need.

petemounce · 2020-08-27T09:29:46Z

@Smithx10 I agree that would be useful, but I don't think it would outweigh the storage costs and operational risk it would incur on the prometheus server.

I followed the advice on Labels (https://prometheus.io/docs/practices/instrumentation/#use-labels) and not to overuse them (https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels) - I could have added a label for each field of a resource.TestResult, but I didn't think I'd need that ever, and I definitely didn't think it would be worth the cardinality explosion that would have been at risk if I had.

The pertinent point from the 2nd link:

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.
If you have a metric that has a cardinality over 100 or the potential to grow that large, investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring and to a general-purpose processing system.
To give you a better idea of the underlying numbers, let's look at node_exporter. node_exporter exposes metrics for every mounted filesystem. Every node will have in the tens of timeseries for, say, node_filesystem_avail. If you have 10,000 nodes, you will end up with roughly 100,000 timeseries for node_filesystem_avail, which is fine for Prometheus to handle.
If you were to now add quota per user, you would quickly reach a double digit number of millions with 10,000 users on 10,000 nodes. This is too much for the current implementation of Prometheus. Even with smaller numbers, there's an opportunity cost as you can't have other, potentially more useful metrics on this machine any more.
If you are unsure, start with no labels and add more labels over time as concrete use cases arise.

Remember also that prometheus doesn't currently have downsampling/compaction (prometheus-junkyard/tsdb#56 and prometheus-junkyard/tsdb#313) - so if a single value is exported that is never seen again, the space on disk for that time series is only reclaimed after the global retention period elapses.

Adding the resource_id to the label set introduces literally unbounded cardinality, and therefore risk, to the monitoring storage.

My working assumption has been the standard "monitoring metrics allow alerts on symptoms, which trigger investigation via metrics dashboards and reading centralised logs (which contain more context)".

Can you suggest any way(s) to compromise?

petemounce · 2020-08-27T12:04:10Z

resource/validate.go

+	return humanOutcomes
+}
+
+func Resources() []string {


This is a hack and I'm about to open a different PR that will let this be replaced by the same pattern that the outputs package uses.

That other PR is #611.

timeu · 2020-08-27T13:10:30Z

@petemounce How about making it configurable ? By default the endpoint only returns metrics with the type label but the user can maybe configure what additional labels to return by specifiying them as CLI arguments ?
So users that's might not have that many goss tests might still decide to accept the higher storage costs for better insights in which tests failed ?

harre-orz · 2020-08-27T16:25:05Z

@petemounce The goss is a test tool, I hope to output success or failure results for each test case...

petemounce · 2020-08-27T23:14:33Z

@timeu
What would it look like to you to have it configurable? Introduce format-options for prometheus, where valid ones would be

resource-type
resource-id
outcome

?

Problem is, at present, there's no way I can see to easily

give an outputer default format-options
remove default format-options from an outputer
(bonus) list outputer-specific format-options in the command-line usage help, tied together

at the existing command-line option (a string slice) without quite a larger-scope change. How would the command-line interface need to change to support this?

@harre-orz

goss is a test tool, and the test outcomes are visible and scrapable within its stdout. I think it's very likely that anyone that works in a place that is going to the trouble of setting up goss to do machine-state assertions, and then the trouble to collect metrics about those centrally, also works in a place where logs are shipped centrally.
- As well as following the prometheus team's own guidance for best practices, I'm using guidance from
  - https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
  - https://paulbellamy.com/2017/08/symptoms-not-causes
I think it's a much safer pattern to not risk the integrity of monitoring's storage as a whole by having a large blast radius like including resource-id would. If monitoring performance degrades, or worse fails catastrophically (OOMs or runs out of storage space), that's a wake-someone-up-severity incident almost everywhere, compared to machine-state assertions going bad one some subset of hosts probably more likely being a bug-severity ticket.
- Cardinality explosion is a well-documented pitfall of using a metrics/monitoring stack for something it's not intended. Google prometheus cardinality explosion and I got 79k hits. Choice reading (prometheus official doc that I've already referenced was 3rd for me):
  - https://www.robustperception.io/cardinality-is-key
    
    If you can't get such granular data in a metrics based monitoring system, doesn't that make it kinda useless? Not really, metrics based monitoring is one set of tradeoffs. It gives you real time information (and real time is expensive as a general rule) about your system in broad strokes, but has cardinality limitations. Complementing metrics are event logs, which tend to be more delayed and have fewer pieces of information, but no cardinality limitations.
  - https://blog.freshtracks.io/bomb-squad-automatic-detection-and-suppression-of-prometheus-cardinality-explosions-62ca8e02fa32

That said - would either of you (or someone else that might be reading) be up for making a pull-request into my branch that adds the ability to customise the set of labels that would be applied to the metrics being output?

Or, do that as a follow-up PR assuming this could be treated as an MVP? This PR meets, in my opinion (and hope :) ), the criteria for being merged in that it adds well-tested non-breaking-change functionality and moves towards what we all want, which is to monitor the outcomes and durations of our machine-state assertions so that we can

alert on assertions starting to fail, triggering an investigation
notice when durations start to drag out beyond whatever our scrape-interval is

(Those are my aims with this PR, at any rate)

petemounce · 2020-09-10T23:17:51Z

@harre-orz, @Smithx10 & @timeu - friendly ping for feedback? I would say that if this merges as-is (it's quite a large PR after all), we can iterate something that satisfies us all? What do you think?

timeu · 2020-09-10T23:23:15Z

@petemounce : Sorry for the lack of feedback. I think that sounds like a plan.

petemounce · 2020-09-18T21:36:53Z

@harre-orz @Smithx10 - do you have any feedback?

harre-orz · 2020-09-19T12:55:35Z

@petemounce Sorry late a feedback.
I understand that cardinality explosion is a risk, I think your code is good.

petemounce · 2020-09-19T13:51:38Z

@harre-orz no worries; thanks for getting back to me. :)

aelsabbahy · 2020-10-03T16:41:39Z

@Smithx10 wondering if you have any feedback on this?

petemounce · 2020-10-07T12:14:48Z

Note - I just added overall outcome counting, and labels to the overall outcome/duration metrics.

petemounce · 2020-10-18T10:03:18Z

Note to watchers - with the change inside #611, I think that would facilitate a follow-on that makes the prometheus output deliver resource-ids if someone wants that to be possible.

I'd prefer that to be non-default for the reasoning above.

That would look roughly like:

making prometheus support the verbose format-option within ValidOptions()
make the counters be part of the Prometheus struct rather than package variables
if that's set via the command-line
- adjust the NewCounterVec calls to add a labelTestId which receives the resource's ID.
- adjust the Prometheus.init to initialise the counters with that extra label

petemounce · 2020-10-18T10:09:30Z

@aelsabbahy I think this is ready to land now, assuming it passes re-review and CI. If it fails CI I'll fix that, obviously, but later today.

Fixes goss-org#175, goss-org#362, goss-org#363

stale · 2021-11-02T18:06:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

freeseacher · 2021-11-10T19:30:11Z

dissapointed

stale · 2022-01-12T00:06:36Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

freeseacher · 2022-01-12T19:34:02Z

dissapointed more

t2d · 2022-04-28T15:44:32Z

@aelsabbahy I think, it would be great to have prometheus output. If I understand correctly, this working PR was stopped by your comment #607 (comment) about the /metrics endpoint.

I don't think I can do the proposed refactoring, but I think, I could just rip put the metrics and make this PR smaller. Would this help to have it merged?

petemounce · 2022-04-28T16:01:42Z

The intent of this PR is to add metrics support. I haven't understood what you propose to remove from this to enable its merge?

t2d · 2022-04-28T16:10:03Z

First, thanks for your work @petemounce !
Second, I don't need /metrics today. But I'm trying to replace nagios with prometheus and goss is one of the last issues. It would be enough if I could just collect the info from /healthz in a manner that prometheus understands.

aelsabbahy · 2022-04-28T16:11:51Z

It's been so long.. so memory is a bit fuzzy on this.

What's the current status of the PR and what remains?

@petemounce Last two comments from you seems to indicate there was some remaining work outstanding?

petemounce · 2022-04-28T19:17:32Z

First, thanks for your work @petemounce ! Second, I don't need /metrics today. But I'm trying to replace nagios with prometheus and goss is one of the last issues. It would be enough if I could just collect the info from /healthz in a manner that prometheus understands.

Both endpoints are served via the same implementation as far as I recall.

@aelsabbahy - basically, the remaining work is what you described within #607 (comment)

aelsabbahy · 2022-05-20T13:25:34Z

If this is good to go, feel free to rebase and I can merge it.

petemounce · 2022-05-20T14:47:10Z

I don't think it is, because of the still-outstanding work to make metrics a concept introduced at the level you directed in #607 (comment)

aelsabbahy · 2022-09-08T14:33:32Z

Hello watchers, there's been a new and simpler attempt at this over in #771, but I do not know if it meets the needs of those asking for this feature.

I'm willing to merge whatever solution the community agrees on.

@petemounce your feedback would be highly appreciated on that PR since you've spent quite a bit of time analyzing this 😄

petemounce · 2022-10-07T20:14:50Z

🥳

DracoBlue · 2022-10-08T11:47:55Z

Thanks for merging!

timeu · 2022-11-09T13:28:22Z

@petemounce
One thing I noticed after trying the new goss version with this change is that the total values are always labelled with outcome "unknown" if the GOSS suite has no failures and skipped tests.
Shouldn't this be set to "pass" instead ?

# HELP goss_tests_run_duration_milliseconds The end-to-end duration of this run.
# TYPE goss_tests_run_duration_milliseconds counter
goss_tests_run_duration_milliseconds{outcome="unknown"} 126
# HELP goss_tests_run_outcomes_total The outcomes of this run as a whole.
# TYPE goss_tests_run_outcomes_total counter
goss_tests_run_outcomes_total{outcome="unknown"} 5

petemounce · 2022-11-09T13:54:49Z

Good spot, yes. Would you mind opening a new issue?

For reference, I think https://github.com/aelsabbahy/goss/blob/master/outputs/prometheus.go#L85 is the culprit - I've neglected to account for the nested loop.

timeu · 2022-11-09T14:05:26Z

@petemounce: Created an issue #789

Also one more question:
If I don't use the serve endpoint + curl to fetch the prometheus metrics but run validate with --format prometheus then all metrics can be treated as guage metrics instead of counters right ?

petemounce · 2022-11-09T14:22:30Z

I think so; the output will be scoped to the content of the run.

aelsabbahy · 2022-11-09T14:38:40Z

@petemounce as part of #789 can you add some documents in the manual or do you prefer a separate ticket?

petemounce · 2022-11-09T14:44:55Z

I'll tackle that there; thanks for flagging.

petemounce marked this pull request as ready for review August 24, 2020 22:50

petemounce mentioned this pull request Aug 25, 2020

goss serve can now negotiate response's content-type via accept request header #609

Merged

3 tasks

aelsabbahy mentioned this pull request Aug 25, 2020

Add Prometheus HTTP Endpoint #362

Closed

petemounce commented Aug 27, 2020

View reviewed changes

petemounce mentioned this pull request Aug 27, 2020

Add ability to runtime list / introspect resources #611

Merged

3 tasks

petemounce force-pushed the prometheus_exporter branch from 01c2eb2 to 80cace5 Compare September 10, 2020 23:16

petemounce force-pushed the prometheus_exporter branch from 80cace5 to b0e5004 Compare September 18, 2020 21:36

petemounce force-pushed the prometheus_exporter branch from b0e5004 to f642ce6 Compare September 24, 2020 09:00

petemounce force-pushed the prometheus_exporter branch from 1623875 to fec281b Compare October 3, 2020 09:39

petemounce force-pushed the prometheus_exporter branch from 04a59fc to 8da4147 Compare October 18, 2020 09:57

petemounce force-pushed the prometheus_exporter branch from 8da4147 to ded6154 Compare October 18, 2020 10:08

petemounce added 4 commits October 20, 2020 11:02

Implement a prometheus output

c604f13

Fixes goss-org#175, goss-org#362, goss-org#363

.

d3d1762

naming: only base units

d1f1612

Appease codeclimate

0eb966c

stale bot removed the stale Used by https://probot.github.io/apps/stale/ label Sep 3, 2021

stale bot added the stale Used by https://probot.github.io/apps/stale/ label Nov 2, 2021

stale bot closed this Nov 10, 2021

ekelali reopened this Nov 11, 2021

stale bot removed the stale Used by https://probot.github.io/apps/stale/ label Nov 11, 2021

stale bot added the stale Used by https://probot.github.io/apps/stale/ label Jan 12, 2022

stale bot removed the stale Used by https://probot.github.io/apps/stale/ label Jan 12, 2022

aelsabbahy mentioned this pull request Sep 1, 2022

Prometheus output format #771

Closed

3 tasks

aelsabbahy merged commit 1743e98 into goss-org:master Oct 7, 2022

sebhoss mentioned this pull request Dec 19, 2023

Add more labels to prometheus output format #862

Closed

Implement a prometheus output for test-metrics scraping #607

Implement a prometheus output for test-metrics scraping #607

Uh oh!

Conversation

petemounce commented Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description of change

Sample (from unit-test log):

Implementation details and notes

Uh oh!

petemounce commented Aug 24, 2020

Uh oh!

Smithx10 commented Aug 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petemounce commented Aug 27, 2020

Uh oh!

petemounce Aug 27, 2020

Choose a reason for hiding this comment

Uh oh!

petemounce Oct 4, 2020

Choose a reason for hiding this comment

Uh oh!

timeu commented Aug 27, 2020

Uh oh!

harre-orz commented Aug 27, 2020

Uh oh!

petemounce commented Aug 27, 2020

Uh oh!

petemounce commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timeu commented Sep 10, 2020

Uh oh!

petemounce commented Sep 18, 2020

Uh oh!

harre-orz commented Sep 19, 2020

Uh oh!

petemounce commented Sep 19, 2020

Uh oh!

aelsabbahy commented Oct 3, 2020

Uh oh!

petemounce commented Oct 7, 2020

Uh oh!

petemounce commented Oct 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petemounce commented Oct 18, 2020

Uh oh!

stale bot commented Nov 2, 2021

Uh oh!

freeseacher commented Nov 10, 2021

Uh oh!

stale bot commented Jan 12, 2022

Uh oh!

freeseacher commented Jan 12, 2022

Uh oh!

t2d commented Apr 28, 2022

Uh oh!

petemounce commented Apr 28, 2022

Uh oh!

t2d commented Apr 28, 2022

Uh oh!

aelsabbahy commented Apr 28, 2022

Uh oh!

petemounce commented Apr 28, 2022

Uh oh!

aelsabbahy commented May 20, 2022

Uh oh!

petemounce commented May 20, 2022

Uh oh!

aelsabbahy commented Sep 8, 2022

Uh oh!

petemounce commented Oct 7, 2022

Uh oh!

DracoBlue commented Oct 8, 2022

Uh oh!

timeu commented Nov 9, 2022

Uh oh!

petemounce commented Aug 24, 2020 •

edited

Loading

Smithx10 commented Aug 26, 2020 •

edited

Loading

petemounce commented Sep 10, 2020 •

edited

Loading

petemounce commented Oct 18, 2020 •

edited

Loading