Skip to content

feat(outputs.health): Add max time between metrics check #16646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 14, 2025

Conversation

tulku
Copy link
Contributor

@tulku tulku commented Mar 17, 2025

Summary

This adds a new check to the health output plugin that can detect moments when no metrics are flowing through telegraf. For example, when impacted by issues like #16070

This is disabled by default, and can be enabled using the new configuration options:

[[outputs.health]]
  service_address = "http://:8080"
  max_time_between_metrics = "1s"
  unhealthy_if_metrics_are_delayed = true

Besides the included unittests, this change was validated running telegraf with the following configuration:

[[inputs.internal]]
  interval = "20s"

[[outputs.health]]
  service_address = "http://:8080"
  max_time_between_metrics = "1s"
  unhealthy_if_metrics_are_delayed = true

And querying the health status using:

while true; do curl -X GET localhost:8080; sleep 0.01; done
  1. Telegraf starts healthy.
  2. Once 1s passes after the first internal metrics are written, the status changes to unhealthy
  3. After that, every 20s, it reports OK for 1s and then goes back to unhealthy.

This PR replaces the previous implementation attempt done in #16212
For more context, refer to this slack thread

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #16645

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Mar 17, 2025
@tulku tulku force-pushed the feat/health_timeout_check branch from 4df75f2 to f83e744 Compare March 17, 2025 09:55
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tulku for your contribution! I think the idea is nice. Some suggestions for simplification in the code. The tests need to be adapted accordingly...

Comment on lines 153 to 155
if h.UnhealthyIfMetricsAreDelayed && h.firstMetricReceived {
healthy = healthy && time.Since(h.lastMetricTime) < time.Duration(h.MaxTimeBetweenMetrics)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if h.UnhealthyIfMetricsAreDelayed && h.firstMetricReceived {
healthy = healthy && time.Since(h.lastMetricTime) < time.Duration(h.MaxTimeBetweenMetrics)
}
if h.MaxTimeBetweenMetrics > 0 {
healthy = healthy && time.Since(h.lastMetricTime) < time.Duration(h.MaxTimeBetweenMetrics)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I added a check for the lastMetricTime being non zero also, so the status is healthy before the first metrics are written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm shouldn't we rather set the h.lastMetricTime to now in Connect as otherwise the output stays healthy if no metric arrives at all...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the output to be healthy if no metric is received. This is because I want to avoid the case where telegraf is restarted (because of being unhealthy) on startup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the last change, lastMetricTime is set to now in Connect.

h.firstMetricReceived = true
h.lastMetricTime = time.Now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about initializing the lastMetricTime in Connect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because Connect is guaranteed to be called before the first Write, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the initialization in NewHealth (which is called from init). If Connect is preferred, I can move it there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Connect would be better. The init function is called when importing this plugin so you might get timeouts if setting up Telegraf takes long. Connect is called after Init but before the first Write and is the better place I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest change, lastMetricTime is initialized as 0 and it is only updated in Write. Whe can add an initialization to now on Connect. This will help catch cases where telegraf starts but no metric is written before the max time elapses.
The current behaviour waits for the first metric to be written to start checking if the time between metrics is longer than expected. This was by design, to prevent the health output to fail during startup. However, if Connect is called 'close enough' to when everything should already be working, we can set lastMetricTime = now in connect also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the last change, lastMetricTime is set to now in Connect.

@srebhan srebhan changed the title feat(health): Add max time between metrics check feat(outputs.health): Add max time between metrics check Mar 20, 2025
@telegraf-tiger telegraf-tiger bot added the plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins label Mar 20, 2025
@srebhan srebhan self-assigned this Mar 20, 2025
@tulku tulku force-pushed the feat/health_timeout_check branch from f83e744 to d5dc1a1 Compare March 27, 2025 16:57
@tulku
Copy link
Contributor Author

tulku commented Mar 27, 2025

Thanks for the review @srebhan ! The code is now simpler, removing the two boolean flags. I took the liberty to initialize lastMetricTime in NewHealth instead of Connect. Let me know if this is a problem.

@tulku tulku requested a review from srebhan March 27, 2025 17:02
@tulku tulku force-pushed the feat/health_timeout_check branch from d5dc1a1 to c42dcf0 Compare April 8, 2025 16:15
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tulku! One more comment... Please also check my responses to the unresolved previous comments...

Comment on lines 235 to 237
MaxTimeBetweenMetrics: 0,
healthy: true,
lastMetricTime: time.Time{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to initialize the two fields as Golang always uses the nil value when creating new instances. So go with

Suggested change
MaxTimeBetweenMetrics: 0,
healthy: true,
lastMetricTime: time.Time{},
healthy: true,

@srebhan
Copy link
Member

srebhan commented Apr 9, 2025

@tulku please also run make lint on your machine and fix the linter issues...

@tulku tulku force-pushed the feat/health_timeout_check branch from c42dcf0 to cf1f78d Compare April 9, 2025 14:55
@tulku tulku requested a review from srebhan April 10, 2025 08:58
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tulku! Please set the initial h.lastMetricTime in Connect() to avoid cases where no metric arrives at all and the plugin is in healthy state.

@tulku tulku force-pushed the feat/health_timeout_check branch from cf1f78d to 38cfe92 Compare April 10, 2025 14:04
@tulku tulku requested a review from srebhan April 10, 2025 14:38
@tulku tulku force-pushed the feat/health_timeout_check branch from 38cfe92 to 8d25ea4 Compare April 10, 2025 14:39
@telegraf-tiger
Copy link
Contributor

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @tulku!

@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Apr 14, 2025
@srebhan srebhan assigned skartikey and mstrandboge and unassigned srebhan Apr 14, 2025
Copy link
Contributor

@skartikey skartikey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tulku Thanks for your contribution!

@skartikey skartikey merged commit be9e5bf into influxdata:master Apr 14, 2025
27 checks passed
@github-actions github-actions bot added this to the v1.35.0 milestone Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add timeout check to the health output plugin
4 participants