-
Notifications
You must be signed in to change notification settings - Fork 5.7k
feat(outputs.health): Add max time between metrics check #16646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4df75f2
to
f83e744
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tulku for your contribution! I think the idea is nice. Some suggestions for simplification in the code. The tests need to be adapted accordingly...
plugins/outputs/health/health.go
Outdated
if h.UnhealthyIfMetricsAreDelayed && h.firstMetricReceived { | ||
healthy = healthy && time.Since(h.lastMetricTime) < time.Duration(h.MaxTimeBetweenMetrics) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if h.UnhealthyIfMetricsAreDelayed && h.firstMetricReceived { | |
healthy = healthy && time.Since(h.lastMetricTime) < time.Duration(h.MaxTimeBetweenMetrics) | |
} | |
if h.MaxTimeBetweenMetrics > 0 { | |
healthy = healthy && time.Since(h.lastMetricTime) < time.Duration(h.MaxTimeBetweenMetrics) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I added a check for the lastMetricTime
being non zero also, so the status is healthy before the first metrics are written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm shouldn't we rather set the h.lastMetricTime
to now
in Connect
as otherwise the output stays healthy if no metric arrives at all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want the output to be healthy if no metric is received. This is because I want to avoid the case where telegraf is restarted (because of being unhealthy) on startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the last change, lastMetricTime
is set to now
in Connect
.
plugins/outputs/health/health.go
Outdated
h.firstMetricReceived = true | ||
h.lastMetricTime = time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about initializing the lastMetricTime
in Connect
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because Connect
is guaranteed to be called before the first Write
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left the initialization in NewHealth
(which is called from init
). If Connect
is preferred, I can move it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Connect
would be better. The init
function is called when importing this plugin so you might get timeouts if setting up Telegraf takes long. Connect
is called after Init
but before the first Write
and is the better place I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest change, lastMetricTime
is initialized as 0 and it is only updated in Write
. Whe can add an initialization to now
on Connect
. This will help catch cases where telegraf starts but no metric is written before the max time elapses.
The current behaviour waits for the first metric to be written to start checking if the time between metrics is longer than expected. This was by design, to prevent the health output to fail during startup. However, if Connect
is called 'close enough' to when everything should already be working, we can set lastMetricTime = now
in connect also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the last change, lastMetricTime
is set to now
in Connect
.
f83e744
to
d5dc1a1
Compare
Thanks for the review @srebhan ! The code is now simpler, removing the two boolean flags. I took the liberty to initialize |
d5dc1a1
to
c42dcf0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tulku! One more comment... Please also check my responses to the unresolved previous comments...
plugins/outputs/health/health.go
Outdated
MaxTimeBetweenMetrics: 0, | ||
healthy: true, | ||
lastMetricTime: time.Time{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to initialize the two fields as Golang always uses the nil
value when creating new instances. So go with
MaxTimeBetweenMetrics: 0, | |
healthy: true, | |
lastMetricTime: time.Time{}, | |
healthy: true, |
@tulku please also run |
c42dcf0
to
cf1f78d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tulku! Please set the initial h.lastMetricTime
in Connect()
to avoid cases where no metric arrives at all and the plugin is in healthy state.
cf1f78d
to
38cfe92
Compare
38cfe92
to
8d25ea4
Compare
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @tulku!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tulku Thanks for your contribution!
Summary
This adds a new check to the
health
output plugin that can detect moments when no metrics are flowing throughtelegraf
. For example, when impacted by issues like #16070This is disabled by default, and can be enabled using the new configuration options:
Besides the included unittests, this change was validated running
telegraf
with the following configuration:And querying the health status using:
This PR replaces the previous implementation attempt done in #16212
For more context, refer to this slack thread
Checklist
Related issues
resolves #16645