feat(monitor): track terminated processes #2156

sthaha · 2025-06-16T05:42:32Z

Previously, between each call to Snapshot() (usually from Prometheus exporter),
processes that terminate go unaccounted for in power consumption metrics.
This creates a data loss problem where workloads that consumed significant power
but are no longer running will not be returned to Prometheus.

Scenario Example:

Prometheus scrapes every 30 seconds
Kepler monitor computes power every 5 seconds (default interval)
Between first and second Prometheus scrape, monitor computes 5-6 times
Each computation overwrites the snapshot with only currently running processes
Terminated processes that consumed power during those 25-30 seconds are lost

Changes include:

Add terminated process tracking to power monitor
Extend power collector to export terminated process metrics
Include comprehensive test coverage for terminated process scenarios
Enhance process lifecycle management for accurate power attribution

internal/exporter/prometheus/collector/power_collector_test.go

codecov · 2025-06-17T22:07:29Z

Codecov Report

Attention: Patch coverage is 88.31169% with 9 lines in your changes missing coverage. Please review.

Project coverage is 91.39%. Comparing base (d700f89) to head (5e91e0f).
Report is 2 commits behind head on reboot.

Files with missing lines	Patch %	Lines
internal/monitor/process.go	79.06%	8 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           reboot    #2156      +/-   ##
==========================================
- Coverage   91.40%   91.39%   -0.01%     
==========================================
  Files          36       36              
  Lines        3491     3546      +55     
==========================================
+ Hits         3191     3241      +50     
- Misses        232      237       +5     
  Partials       68       68

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

vimalk78 · 2025-06-19T06:54:34Z

internal/exporter/prometheus/collector/power_collector.go

@@ -272,7 +274,7 @@ func (c *PowerCollector) collectNodeMetrics(ch chan<- prometheus.Metric, node *m
 }

 // collectProcessMetrics collects process-level power metrics
-func (c *PowerCollector) collectProcessMetrics(ch chan<- prometheus.Metric, processes monitor.Processes) {
+func (c *PowerCollector) collectProcessMetrics(ch chan<- prometheus.Metric, state string, processes monitor.Processes) {
 	if len(processes) == 0 {
 		c.logger.Debug("No processes to export metrics for")


change log to print state

internal/monitor/monitor.go

vimalk78 · 2025-06-19T10:42:31Z

internal/monitor/process.go

+	if !pm.exported.Load() {
+		// NOTE: no need to deep clone since already terminated processes won't be updated
+		maps.Copy(newSnapshot.TerminatedProcesses, prev.TerminatedProcesses)
+	}


if Snapshot() is not called, the processes could start accumulating here. can we keep a maximum limit?

Yes, I thought of the same but with maxRetention, I think it is better to implement a PQ with a fixed size. (In a different PR) ?

Previously, between each call to `Snapshot()` (usually from Prometheus exporter), processes that terminate go unaccounted for in power consumption metrics. This creates a data loss problem where workloads that consumed significant power but are no longer running will not be returned to Prometheus. **Scenario Example:** - Prometheus scrapes every 30 seconds - Kepler monitor computes power every 5 seconds (default interval) - Between first and second Prometheus scrape, monitor computes 5-6 times - Each computation overwrites the snapshot with only currently running processes - Terminated processes that consumed power during those 25-30 seconds are lost Changes include: - Add terminated process tracking to power monitor - Extend power collector to export terminated process metrics - Include comprehensive test coverage for terminated process scenarios - Enhance process lifecycle management for accurate power attribution Signed-off-by: Sunil Thaha <sthaha@redhat.com>

github-actions bot added the feat A new feature or enhancement label Jun 16, 2025

sthaha commented Jun 16, 2025

View reviewed changes

internal/exporter/prometheus/collector/power_collector_test.go Show resolved Hide resolved

sthaha force-pushed the feat-track-terminated branch from b6a356a to 73ecd3b Compare June 17, 2025 22:06

sthaha requested a review from vimalk78 June 17, 2025 22:07

sthaha force-pushed the feat-track-terminated branch from 73ecd3b to 9c154af Compare June 17, 2025 22:11

sthaha requested a review from vprashar2929 June 17, 2025 22:11

sthaha force-pushed the feat-track-terminated branch 3 times, most recently from 611731d to 218da7b Compare June 18, 2025 12:24

sthaha marked this pull request as ready for review June 19, 2025 06:51

sthaha force-pushed the feat-track-terminated branch from 218da7b to 643ac78 Compare June 19, 2025 08:39

vimalk78 previously approved these changes Jun 19, 2025

View reviewed changes

vimalk78 reviewed Jun 19, 2025

View reviewed changes

sthaha dismissed vimalk78’s stale review via 5e91e0f June 19, 2025 11:24

sthaha force-pushed the feat-track-terminated branch from 643ac78 to 5e91e0f Compare June 19, 2025 11:24

vimalk78 approved these changes Jun 19, 2025

View reviewed changes

vimalk78 merged commit f717b73 into sustainable-computing-io:reboot Jun 19, 2025
17 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(monitor): track terminated processes #2156

feat(monitor): track terminated processes #2156

Uh oh!

sthaha commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

codecov bot commented Jun 17, 2025 •

edited

Loading

Uh oh!

vimalk78 Jun 19, 2025

Uh oh!

Uh oh!

vimalk78 Jun 19, 2025

Uh oh!

sthaha Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

feat(monitor): track terminated processes #2156

feat(monitor): track terminated processes #2156

Uh oh!

Conversation

sthaha commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vimalk78 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vimalk78 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

sthaha Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sthaha commented Jun 16, 2025 •

edited

Loading

codecov bot commented Jun 17, 2025 •

edited

Loading