Skip to content

Conversation

sthaha
Copy link
Collaborator

@sthaha sthaha commented Jun 3, 2025

This PR fixes a critical power attribution bug where workloads (processes/containers/VMs)
were being attributed the full node power consumption instead of only the power
corresponding to their actual CPU utilization.

Key Changes:

  • Introduced used vs idle power split: Node power is now divided between "used" (active workload) and "idle" (system overhead) portions based on actual CPU usage ratio from /proc/stat
  • Fixed attribution algorithm: Processes/containers now receive power attribution only from the "used" power portion rather than total node power
  • Enhanced metrics granularity: Added new kepler_node_cpu_[active|idle]_[watts|joules_total] kepler_node_cpu_usage_ratio metric

Impact: Provides significantly more accurate power attribution that correctly reflects the relationship between CPU utilization and energy consumption, eliminating over-attribution to workloads.

@github-actions github-actions bot added the fix A bug fix label Jun 3, 2025
@sthaha sthaha requested a review from vimalk78 June 4, 2025 06:25
@sthaha sthaha force-pushed the fix-power-attribution branch 2 times, most recently from 32eedaa to 1c74d29 Compare June 4, 2025 07:45
@sthaha sthaha force-pushed the fix-power-attribution branch from 1c74d29 to cf402c6 Compare June 5, 2025 06:56
Copy link

codecov bot commented Jun 5, 2025

Codecov Report

Attention: Patch coverage is 96.34146% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.65%. Comparing base (5a7e093) to head (913a0be).
Report is 3 commits behind head on reboot.

Files with missing lines Patch % Lines
internal/resource/procfs_reader.go 81.25% 4 Missing and 2 partials ⚠️
internal/resource/informer.go 72.72% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           reboot    #2125      +/-   ##
==========================================
+ Coverage   92.46%   92.65%   +0.19%     
==========================================
  Files          33       33              
  Lines        2587     2779     +192     
==========================================
+ Hits         2392     2575     +183     
- Misses        154      160       +6     
- Partials       41       44       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sthaha sthaha force-pushed the fix-power-attribution branch 2 times, most recently from 962c3c2 to f461dfa Compare June 5, 2025 08:24
@sthaha sthaha changed the title WIP: fix power attribution fix(power): correct power attribution by separating used vs idle power Jun 5, 2025
@sthaha sthaha marked this pull request as ready for review June 5, 2025 08:26
@sthaha sthaha force-pushed the fix-power-attribution branch 4 times, most recently from 38d74ab to 2b1b99b Compare June 5, 2025 13:00
@github-actions github-actions bot added the chore Routine tasks or maintenance label Jun 5, 2025
@sthaha sthaha force-pushed the fix-power-attribution branch from 2b1b99b to c1d1b52 Compare June 5, 2025 13:09
sthaha added 2 commits June 5, 2025 09:46
…ibution

Previously, workloads were attributed power from total node consumption,
causing over-attribution during low CPU usage periods. This commit implements
proportional power attribution using CPU utilization ratios.

Changes:
- Add NodeUsage type with active/idle power breakdown
- Calculate CPU usage ratio from /proc/stat
- Attribute only active power to workloads based on CPU time
- Export separate active/idle power metrics

Fixes power attribution accuracy by ensuring workloads are only attributed
power proportional to their actual CPU utilization.

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
Signed-off-by: Sunil Thaha <sthaha@redhat.com>
@sthaha sthaha force-pushed the fix-power-attribution branch from c1d1b52 to 913a0be Compare June 5, 2025 13:46
Comment on lines +105 to +106
// active over total, where active = total - idle
// and total = user + nice + system + idle + iowait + irq + softirq + steal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment to say active = total - (idle + iowait)

@@ -14,32 +14,37 @@ import (
"k8s.io/utils/clock"
)

type Node struct {
CPUTimeDelta float64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this field contains sum(proc.CPUTimeDelta) , not related to node

Comment on lines +222 to +236
ch <- prometheus.MustNewConstMetric(
c.nodeCPUActiveJoulesDesc,
prometheus.CounterValue,
energy.ActiveEnergy.Joules(),
zoneName, path,
)

ch <- prometheus.MustNewConstMetric(
c.nodeCPUIdleJoulesDesc,
prometheus.CounterValue,
energy.IdleEnergy.Joules(),
zoneName, path,
)

// watts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node active/idle joules goes up and down, it is more like a gauge than a counter
image

Copy link
Collaborator Author

@sthaha sthaha Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you plotting? rate ?

You are right ... I am not incrementing (prev + current) the counter and only calculating the irate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, not with rate

@sthaha
Copy link
Collaborator Author

sthaha commented Jun 7, 2025

@vimalk78 I am merging this so that pod can be rebased on this. I will continue to fix the incorrect implementation of ActiveEnergy counter as a follow up commit.

@sthaha sthaha merged commit 11f34cd into sustainable-computing-io:reboot Jun 7, 2025
16 checks passed
@vprashar2929
Copy link
Collaborator

also:
Screenshot 2025-06-07 at 5 05 44 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Routine tasks or maintenance fix A bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants