Skip to content

Conversation

e-ngo
Copy link
Contributor

@e-ngo e-ngo commented Jun 19, 2025

  • use longtext to fit larger clusters

  • fix scheduler and seed-client register logic

Description

This change updates the manager peer & scheduler registration logic to use more stable fields, ie. host names as opposed to IPs which can change when a pod is rescheduled.

Also use longtext instead of just text so that warming jobs don't fail in large clusters.

Related Issue

Maybe dragonflyoss/client#1116

Motivation and Context

Pod IPs will change. Pods will be rescheduled to other nodes, etc. When manager returns list of schedulers, it grabs the list in order. Thus it will ALWAYS return a stale scheduler / seed client address. The scheduler/seed reregistration SHOULD over rid the OLD entries.

Additionally jobs may be larger in pieces. This won't fit in text. Fit it in longtext instead.

Screenshots (if appropriate)

Types of changes

  • [ x] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation Update (if none of the other choices apply)

Checklist

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

e-ngo added 2 commits June 19, 2025 11:58
* use longtext to fit larger clusters

* fix scheduler and seed-client register logic
@e-ngo e-ngo requested a review from a team as a code owner June 19, 2025 19:08
@e-ngo e-ngo requested review from liubin, gaius-qi and cndoit18 June 19, 2025 19:08
Copy link

codecov bot commented Jun 20, 2025

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 32.96%. Comparing base (ef9c9ef) to head (c5a3edb).
Report is 26 commits behind head on main.

Files with missing lines Patch % Lines
manager/models/models.go 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #4145   +/-   ##
=======================================
  Coverage   32.96%   32.96%           
=======================================
  Files         346      346           
  Lines       40922    40918    -4     
=======================================
  Hits        13490    13490           
+ Misses      26538    26534    -4     
  Partials      894      894           
Flag Coverage Δ
unittests 32.96% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
manager/rpcserver/manager_server_v2.go 0.00% <ø> (ø)
manager/models/models.go 27.39% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gaius-qi
Copy link
Member

@e-ngo I think the change to longtext is good.

After the scheduler is started, it will maintain keepalive with the manager. If the scheduler down and the keepalive connection with the manager is disconnected, the scheduler instance status in the database will become inactive.

However, for problem A, when the manager down, the scheduler restart cannot notify the manager that it has down. So unable to update the status of scheduler instance in the database to inactive.

I think the solution to this problem is to scan the scheduler and seed client tables regularly in database, and automatically delete the row in the database if the scheduler and seed client do not report for more than 2 keepalive intervals.

@gaius-qi gaius-qi added the enhancement New feature or request label Jun 20, 2025
@gaius-qi gaius-qi added this to the v2.3.0 milestone Jun 20, 2025
@e-ngo
Copy link
Contributor Author

e-ngo commented Jun 20, 2025

Yea, that makes sense. Can make that change

@yxxhero
Copy link
Member

yxxhero commented Jun 21, 2025

@e-ngo just do it

@gaius-qi gaius-qi modified the milestones: v2.3.0, v2.4.0 Jul 1, 2025
@chlins
Copy link
Member

chlins commented Jul 3, 2025

@e-ngo I think the change to longtext is good.

After the scheduler is started, it will maintain keepalive with the manager. If the scheduler down and the keepalive connection with the manager is disconnected, the scheduler instance status in the database will become inactive.

However, for problem A, when the manager down, the scheduler restart cannot notify the manager that it has down. So unable to update the status of scheduler instance in the database to inactive.

I think the solution to this problem is to scan the scheduler and seed client tables regularly in database, and automatically delete the row in the database if the scheduler and seed client do not report for more than 2 keepalive intervals.

@e-ngo Hi, I think the issue @gaius-qi mentioned will be resolved by #4179, so this PR can be focused on the db column type change, so could you update your PR to only retain the "longtext" change, then we can merge it.

Copy link
Member

@gaius-qi gaius-qi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chlins
Copy link
Member

chlins commented Jul 3, 2025

@e-ngo We will merge it and then restore the change for removing old entries.

Copy link
Member

@chlins chlins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@chlins chlins merged commit b66c49f into dragonflyoss:main Jul 3, 2025
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants