-
Notifications
You must be signed in to change notification settings - Fork 214
feat(monitor): track terminated processes #2156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(monitor): track terminated processes #2156
Conversation
b6a356a
to
73ecd3b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## reboot #2156 +/- ##
==========================================
- Coverage 91.40% 91.39% -0.01%
==========================================
Files 36 36
Lines 3491 3546 +55
==========================================
+ Hits 3191 3241 +50
- Misses 232 237 +5
Partials 68 68 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
73ecd3b
to
9c154af
Compare
611731d
to
218da7b
Compare
218da7b
to
643ac78
Compare
@@ -272,7 +274,7 @@ func (c *PowerCollector) collectNodeMetrics(ch chan<- prometheus.Metric, node *m | |||
} | |||
|
|||
// collectProcessMetrics collects process-level power metrics | |||
func (c *PowerCollector) collectProcessMetrics(ch chan<- prometheus.Metric, processes monitor.Processes) { | |||
func (c *PowerCollector) collectProcessMetrics(ch chan<- prometheus.Metric, state string, processes monitor.Processes) { | |||
if len(processes) == 0 { | |||
c.logger.Debug("No processes to export metrics for") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change log to print state
if !pm.exported.Load() { | ||
// NOTE: no need to deep clone since already terminated processes won't be updated | ||
maps.Copy(newSnapshot.TerminatedProcesses, prev.TerminatedProcesses) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if Snapshot() is not called, the processes could start accumulating here. can we keep a maximum limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought of the same but with maxRetention, I think it is better to implement a PQ with a fixed size. (In a different PR) ?
Previously, between each call to `Snapshot()` (usually from Prometheus exporter), processes that terminate go unaccounted for in power consumption metrics. This creates a data loss problem where workloads that consumed significant power but are no longer running will not be returned to Prometheus. **Scenario Example:** - Prometheus scrapes every 30 seconds - Kepler monitor computes power every 5 seconds (default interval) - Between first and second Prometheus scrape, monitor computes 5-6 times - Each computation overwrites the snapshot with only currently running processes - Terminated processes that consumed power during those 25-30 seconds are lost Changes include: - Add terminated process tracking to power monitor - Extend power collector to export terminated process metrics - Include comprehensive test coverage for terminated process scenarios - Enhance process lifecycle management for accurate power attribution Signed-off-by: Sunil Thaha <sthaha@redhat.com>
643ac78
to
5e91e0f
Compare
f717b73
into
sustainable-computing-io:reboot
Previously, between each call to
Snapshot()
(usually from Prometheus exporter),processes that terminate go unaccounted for in power consumption metrics.
This creates a data loss problem where workloads that consumed significant power
but are no longer running will not be returned to Prometheus.
Scenario Example:
Changes include: