Skip to content

Conversation

petemounce
Copy link
Collaborator

@petemounce petemounce commented Aug 24, 2020

Checklist
  • make test-all (UNIX) passes. CI will also test this
  • unit and/or integration tests are included (if applicable)
  • documentation is changed or added (if applicable)

Description of change

Fixes #175, #362, #363.

Adds a prometheus output format. I want to be able to scrape my goss serve endpoints so that I can

  • get metrics for the overall suite-run duration (summary equivalent)
    • ... so I can adjust the timeout my healthchecks have against cloud instances so the timeout is comfortably larger than the goss run duration
    • ... so I can see overall outcomes over time
  • get outcome metrics
    • ... so I can alert on fails based on when they start to appear, triggering an investigation into what caused my state to drift from "good"
      • note - it is an anti-goal to alert on specific assertion IDs because of the metrics cardinality that would result in. See description below and discussion inside comments
    • ... so I can understand on a more granular level whether there are duration fluctuations or outcome spikes across assertion types

This records

  • overall test-suite run outcome & duration
  • test outcomes
  • test durations
  • ... where both test outcomes and durations can be grouped by outcome and assertion-type.

Sample (from unit-test log):

# TYPE goss_tests_outcomes_duration_seconds counter
goss_tests_outcomes_duration_seconds{outcome="fail",type="command"} 0.01
goss_tests_outcomes_duration_seconds{outcome="pass",type="command"} 0.02
goss_tests_outcomes_duration_seconds{outcome="skip",type="file"} 0.01
# HELP goss_tests_outcomes_total The number of test-outcomes from this run.
# TYPE goss_tests_outcomes_total counter
goss_tests_outcomes_total{outcome="fail",type="command"} 1
goss_tests_outcomes_total{outcome="pass",type="command"} 2
goss_tests_outcomes_total{outcome="skip",type="file"} 1
# HELP goss_tests_run_duration_seconds The end-to-end duration of this run.
# TYPE goss_tests_run_duration_seconds counter
goss_tests_run_duration_seconds 60

Implementation details and notes

I expect to be using this most in the context of goss serve --format prometheus, so that I can configure my prometheus servers to scrape servers that I'm running goss endpoints on. Therefore https://prometheus.io/docs/practices/instrumentation/#online-serving-systems was the most appropriate model to choose.

I followed the advice on metric and label naming (https://prometheus.io/docs/practices/naming/) (er, actually I didn't initially and will push a fix for that fixed in
1f2cd75).

I followed the advice on Labels (https://prometheus.io/docs/practices/instrumentation/#use-labels) and not to overuse them (https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels) - I could have added a label for each field of a resource.TestResult, but I didn't think I'd need that ever, and I definitely didn't think it would be worth the cardinality explosion that would have been at risk if I had.

I chose to not use the prometheus Default Registry, because this also adds in process and golang metrics that I don't think I need. This is the registry = prometheus.NewRegistry() and .With(registry) bits inside the var block. It's easy to swap back to the defaults again, and the change in doing so would be additive, not breaking, to the output.

  • I chose to reverse that, since it was simply easier to then also have an always-on /metrics endpoint. That's what that ecosystem expects and uses by default.

@petemounce petemounce marked this pull request as ready for review August 24, 2020 22:50
@petemounce
Copy link
Collaborator Author

@aelsabbahy this is ready for review, I think. If it's a welcome PR, then the following people might be interested too based on the previous tickets I've seen. As with #585, I'll let you decide whether/when to do that.

@Smithx10
Copy link

Smithx10 commented Aug 26, 2020

@petemounce I think its important for visualization over time that the operator can discover what goss check failed and when.

The format in PR #175, looks like it would provide the operator with all the information they need.

@petemounce
Copy link
Collaborator Author

@Smithx10 I agree that would be useful, but I don't think it would outweigh the storage costs and operational risk it would incur on the prometheus server.

I followed the advice on Labels (https://prometheus.io/docs/practices/instrumentation/#use-labels) and not to overuse them (https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels) - I could have added a label for each field of a resource.TestResult, but I didn't think I'd need that ever, and I definitely didn't think it would be worth the cardinality explosion that would have been at risk if I had.

The pertinent point from the 2nd link:

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.
If you have a metric that has a cardinality over 100 or the potential to grow that large, investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring and to a general-purpose processing system.
To give you a better idea of the underlying numbers, let's look at node_exporter. node_exporter exposes metrics for every mounted filesystem. Every node will have in the tens of timeseries for, say, node_filesystem_avail. If you have 10,000 nodes, you will end up with roughly 100,000 timeseries for node_filesystem_avail, which is fine for Prometheus to handle.
If you were to now add quota per user, you would quickly reach a double digit number of millions with 10,000 users on 10,000 nodes. This is too much for the current implementation of Prometheus. Even with smaller numbers, there's an opportunity cost as you can't have other, potentially more useful metrics on this machine any more.
If you are unsure, start with no labels and add more labels over time as concrete use cases arise.

Remember also that prometheus doesn't currently have downsampling/compaction (prometheus-junkyard/tsdb#56 and prometheus-junkyard/tsdb#313) - so if a single value is exported that is never seen again, the space on disk for that time series is only reclaimed after the global retention period elapses.

Adding the resource_id to the label set introduces literally unbounded cardinality, and therefore risk, to the monitoring storage.

My working assumption has been the standard "monitoring metrics allow alerts on symptoms, which trigger investigation via metrics dashboards and reading centralised logs (which contain more context)".

Can you suggest any way(s) to compromise?

return humanOutcomes
}

func Resources() []string {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hack and I'm about to open a different PR that will let this be replaced by the same pattern that the outputs package uses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That other PR is #611.

@timeu
Copy link

timeu commented Aug 27, 2020

@petemounce How about making it configurable ? By default the endpoint only returns metrics with the type label but the user can maybe configure what additional labels to return by specifiying them as CLI arguments ?
So users that's might not have that many goss tests might still decide to accept the higher storage costs for better insights in which tests failed ?

@harre-orz
Copy link

@petemounce The goss is a test tool, I hope to output success or failure results for each test case...

@petemounce
Copy link
Collaborator Author

@timeu
What would it look like to you to have it configurable? Introduce format-options for prometheus, where valid ones would be

  • resource-type
  • resource-id
  • outcome

?

Problem is, at present, there's no way I can see to easily

  • give an outputer default format-options
  • remove default format-options from an outputer
  • (bonus) list outputer-specific format-options in the command-line usage help, tied together

at the existing command-line option (a string slice) without quite a larger-scope change. How would the command-line interface need to change to support this?

@harre-orz

  • goss is a test tool, and the test outcomes are visible and scrapable within its stdout. I think it's very likely that anyone that works in a place that is going to the trouble of setting up goss to do machine-state assertions, and then the trouble to collect metrics about those centrally, also works in a place where logs are shipped centrally.
  • I think it's a much safer pattern to not risk the integrity of monitoring's storage as a whole by having a large blast radius like including resource-id would. If monitoring performance degrades, or worse fails catastrophically (OOMs or runs out of storage space), that's a wake-someone-up-severity incident almost everywhere, compared to machine-state assertions going bad one some subset of hosts probably more likely being a bug-severity ticket.
    • Cardinality explosion is a well-documented pitfall of using a metrics/monitoring stack for something it's not intended. Google prometheus cardinality explosion and I got 79k hits. Choice reading (prometheus official doc that I've already referenced was 3rd for me):

That said - would either of you (or someone else that might be reading) be up for making a pull-request into my branch that adds the ability to customise the set of labels that would be applied to the metrics being output?

Or, do that as a follow-up PR assuming this could be treated as an MVP? This PR meets, in my opinion (and hope :) ), the criteria for being merged in that it adds well-tested non-breaking-change functionality and moves towards what we all want, which is to monitor the outcomes and durations of our machine-state assertions so that we can

  • alert on assertions starting to fail, triggering an investigation
  • notice when durations start to drag out beyond whatever our scrape-interval is

(Those are my aims with this PR, at any rate)

@petemounce
Copy link
Collaborator Author

petemounce commented Sep 10, 2020

@harre-orz, @Smithx10 & @timeu - friendly ping for feedback? I would say that if this merges as-is (it's quite a large PR after all), we can iterate something that satisfies us all? What do you think?

@timeu
Copy link

timeu commented Sep 10, 2020

@petemounce : Sorry for the lack of feedback. I think that sounds like a plan.

@petemounce
Copy link
Collaborator Author

@harre-orz @Smithx10 - do you have any feedback?

@harre-orz
Copy link

@petemounce Sorry late a feedback.
I understand that cardinality explosion is a risk, I think your code is good.

@petemounce
Copy link
Collaborator Author

@harre-orz no worries; thanks for getting back to me. :)

@aelsabbahy
Copy link
Member

@Smithx10 wondering if you have any feedback on this?

@petemounce
Copy link
Collaborator Author

Note - I just added overall outcome counting, and labels to the overall outcome/duration metrics.

@petemounce
Copy link
Collaborator Author

petemounce commented Oct 18, 2020

Note to watchers - with the change inside #611, I think that would facilitate a follow-on that makes the prometheus output deliver resource-ids if someone wants that to be possible.

I'd prefer that to be non-default for the reasoning above.

That would look roughly like:

  • making prometheus support the verbose format-option within ValidOptions()
  • make the counters be part of the Prometheus struct rather than package variables
  • if that's set via the command-line
    • adjust the NewCounterVec calls to add a labelTestId which receives the resource's ID.
    • adjust the Prometheus.init to initialise the counters with that extra label

@petemounce
Copy link
Collaborator Author

@aelsabbahy I think this is ready to land now, assuming it passes re-review and CI. If it fails CI I'll fix that, obviously, but later today.

@stale stale bot removed the stale Used by https://probot.github.io/apps/stale/ label Sep 3, 2021
@stale
Copy link

stale bot commented Nov 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Used by https://probot.github.io/apps/stale/ label Nov 2, 2021
@stale stale bot closed this Nov 10, 2021
@freeseacher
Copy link

dissapointed

@ekelali ekelali reopened this Nov 11, 2021
@stale stale bot removed the stale Used by https://probot.github.io/apps/stale/ label Nov 11, 2021
@stale
Copy link

stale bot commented Jan 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Used by https://probot.github.io/apps/stale/ label Jan 12, 2022
@freeseacher
Copy link

dissapointed more

@stale stale bot removed the stale Used by https://probot.github.io/apps/stale/ label Jan 12, 2022
@t2d
Copy link

t2d commented Apr 28, 2022

@aelsabbahy I think, it would be great to have prometheus output. If I understand correctly, this working PR was stopped by your comment #607 (comment) about the /metrics endpoint.

I don't think I can do the proposed refactoring, but I think, I could just rip put the metrics and make this PR smaller. Would this help to have it merged?

@petemounce
Copy link
Collaborator Author

The intent of this PR is to add metrics support. I haven't understood what you propose to remove from this to enable its merge?

@t2d
Copy link

t2d commented Apr 28, 2022

First, thanks for your work @petemounce !
Second, I don't need /metrics today. But I'm trying to replace nagios with prometheus and goss is one of the last issues. It would be enough if I could just collect the info from /healthz in a manner that prometheus understands.

@aelsabbahy
Copy link
Member

It's been so long.. so memory is a bit fuzzy on this.

What's the current status of the PR and what remains?

@petemounce Last two comments from you seems to indicate there was some remaining work outstanding?

@petemounce
Copy link
Collaborator Author

First, thanks for your work @petemounce ! Second, I don't need /metrics today. But I'm trying to replace nagios with prometheus and goss is one of the last issues. It would be enough if I could just collect the info from /healthz in a manner that prometheus understands.

Both endpoints are served via the same implementation as far as I recall.

@aelsabbahy - basically, the remaining work is what you described within #607 (comment)

@aelsabbahy
Copy link
Member

If this is good to go, feel free to rebase and I can merge it.

@petemounce
Copy link
Collaborator Author

I don't think it is, because of the still-outstanding work to make metrics a concept introduced at the level you directed in #607 (comment)

@aelsabbahy aelsabbahy mentioned this pull request Sep 1, 2022
3 tasks
@aelsabbahy
Copy link
Member

Hello watchers, there's been a new and simpler attempt at this over in #771, but I do not know if it meets the needs of those asking for this feature.

I'm willing to merge whatever solution the community agrees on.

@petemounce your feedback would be highly appreciated on that PR since you've spent quite a bit of time analyzing this 😄

@aelsabbahy aelsabbahy merged commit 1743e98 into goss-org:master Oct 7, 2022
@petemounce
Copy link
Collaborator Author

🥳

@DracoBlue
Copy link

Thanks for merging!

@timeu
Copy link

timeu commented Nov 9, 2022

@petemounce
One thing I noticed after trying the new goss version with this change is that the total values are always labelled with outcome "unknown" if the GOSS suite has no failures and skipped tests.
Shouldn't this be set to "pass" instead ?

# HELP goss_tests_run_duration_milliseconds The end-to-end duration of this run.
# TYPE goss_tests_run_duration_milliseconds counter
goss_tests_run_duration_milliseconds{outcome="unknown"} 126
# HELP goss_tests_run_outcomes_total The outcomes of this run as a whole.
# TYPE goss_tests_run_outcomes_total counter
goss_tests_run_outcomes_total{outcome="unknown"} 5

@petemounce
Copy link
Collaborator Author

Good spot, yes. Would you mind opening a new issue?

For reference, I think https://github.com/aelsabbahy/goss/blob/master/outputs/prometheus.go#L85 is the culprit - I've neglected to account for the nested loop.

@timeu
Copy link

timeu commented Nov 9, 2022

@petemounce: Created an issue #789

Also one more question:
If I don't use the serve endpoint + curl to fetch the prometheus metrics but run validate with --format prometheus then all metrics can be treated as guage metrics instead of counters right ?

@petemounce
Copy link
Collaborator Author

I think so; the output will be scoped to the content of the run.

@aelsabbahy
Copy link
Member

@petemounce as part of #789 can you add some documents in the manual or do you prefer a separate ticket?

@petemounce
Copy link
Collaborator Author

I'll tackle that there; thanks for flagging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants