Skip to content

Conversation

sthaha
Copy link
Collaborator

@sthaha sthaha commented Jun 16, 2025

Previously, between each call to Snapshot() (usually from Prometheus exporter),
processes that terminate go unaccounted for in power consumption metrics.
This creates a data loss problem where workloads that consumed significant power
but are no longer running will not be returned to Prometheus.

Scenario Example:

  • Prometheus scrapes every 30 seconds
  • Kepler monitor computes power every 5 seconds (default interval)
  • Between first and second Prometheus scrape, monitor computes 5-6 times
  • Each computation overwrites the snapshot with only currently running processes
  • Terminated processes that consumed power during those 25-30 seconds are lost

Changes include:

  • Add terminated process tracking to power monitor
  • Extend power collector to export terminated process metrics
  • Include comprehensive test coverage for terminated process scenarios
  • Enhance process lifecycle management for accurate power attribution

@github-actions github-actions bot added the feat A new feature or enhancement label Jun 16, 2025
@sthaha sthaha force-pushed the feat-track-terminated branch from b6a356a to 73ecd3b Compare June 17, 2025 22:06
@sthaha sthaha requested a review from vimalk78 June 17, 2025 22:07
Copy link

codecov bot commented Jun 17, 2025

Codecov Report

Attention: Patch coverage is 88.31169% with 9 lines in your changes missing coverage. Please review.

Project coverage is 91.39%. Comparing base (d700f89) to head (5e91e0f).
Report is 2 commits behind head on reboot.

Files with missing lines Patch % Lines
internal/monitor/process.go 79.06% 8 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           reboot    #2156      +/-   ##
==========================================
- Coverage   91.40%   91.39%   -0.01%     
==========================================
  Files          36       36              
  Lines        3491     3546      +55     
==========================================
+ Hits         3191     3241      +50     
- Misses        232      237       +5     
  Partials       68       68              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sthaha sthaha force-pushed the feat-track-terminated branch from 73ecd3b to 9c154af Compare June 17, 2025 22:11
@sthaha sthaha requested a review from vprashar2929 June 17, 2025 22:11
@sthaha sthaha force-pushed the feat-track-terminated branch 3 times, most recently from 611731d to 218da7b Compare June 18, 2025 12:24
@sthaha sthaha marked this pull request as ready for review June 19, 2025 06:51
@sthaha sthaha force-pushed the feat-track-terminated branch from 218da7b to 643ac78 Compare June 19, 2025 08:39
vimalk78
vimalk78 previously approved these changes Jun 19, 2025
@@ -272,7 +274,7 @@ func (c *PowerCollector) collectNodeMetrics(ch chan<- prometheus.Metric, node *m
}

// collectProcessMetrics collects process-level power metrics
func (c *PowerCollector) collectProcessMetrics(ch chan<- prometheus.Metric, processes monitor.Processes) {
func (c *PowerCollector) collectProcessMetrics(ch chan<- prometheus.Metric, state string, processes monitor.Processes) {
if len(processes) == 0 {
c.logger.Debug("No processes to export metrics for")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change log to print state

Comment on lines +142 to +145
if !pm.exported.Load() {
// NOTE: no need to deep clone since already terminated processes won't be updated
maps.Copy(newSnapshot.TerminatedProcesses, prev.TerminatedProcesses)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Snapshot() is not called, the processes could start accumulating here. can we keep a maximum limit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought of the same but with maxRetention, I think it is better to implement a PQ with a fixed size. (In a different PR) ?

Previously, between each call to `Snapshot()` (usually from Prometheus exporter),
processes that terminate go unaccounted for in power consumption metrics.
This creates a data loss problem where workloads that consumed significant power
but are no longer running will not be returned to Prometheus.

**Scenario Example:**
- Prometheus scrapes every 30 seconds
- Kepler monitor computes power every 5 seconds (default interval)
- Between first and second Prometheus scrape, monitor computes 5-6 times
- Each computation overwrites the snapshot with only currently running processes
- Terminated processes that consumed power during those 25-30 seconds are lost

Changes include:
- Add terminated process tracking to power monitor
- Extend power collector to export terminated process metrics
- Include comprehensive test coverage for terminated process scenarios
- Enhance process lifecycle management for accurate power attribution

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
@vimalk78 vimalk78 merged commit f717b73 into sustainable-computing-io:reboot Jun 19, 2025
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat A new feature or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants