Tune dragonfly to Remove Old Entries & Use Longtext to fit larger warming jobs #4145

e-ngo · 2025-06-19T19:08:21Z

use longtext to fit larger clusters
fix scheduler and seed-client register logic

Description

This change updates the manager peer & scheduler registration logic to use more stable fields, ie. host names as opposed to IPs which can change when a pod is rescheduled.

Also use longtext instead of just text so that warming jobs don't fail in large clusters.

Related Issue

Maybe dragonflyoss/client#1116

Motivation and Context

Pod IPs will change. Pods will be rescheduled to other nodes, etc. When manager returns list of schedulers, it grabs the list in order. Thus it will ALWAYS return a stale scheduler / seed client address. The scheduler/seed reregistration SHOULD over rid the OLD entries.

Additionally jobs may be larger in pieces. This won't fit in text. Fit it in longtext instead.

Screenshots (if appropriate)

Types of changes

[ x] Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation Update (if none of the other choices apply)

Checklist

My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.

* use longtext to fit larger clusters * fix scheduler and seed-client register logic

codecov · 2025-06-20T02:41:08Z

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 32.96%. Comparing base (ef9c9ef) to head (c5a3edb).
Report is 26 commits behind head on main.

Files with missing lines	Patch %	Lines
manager/models/models.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4145   +/-   ##
=======================================
  Coverage   32.96%   32.96%           
=======================================
  Files         346      346           
  Lines       40922    40918    -4     
=======================================
  Hits        13490    13490           
+ Misses      26538    26534    -4     
  Partials      894      894

Flag	Coverage Δ
unittests	`32.96% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
manager/rpcserver/manager_server_v2.go	`0.00% <ø> (ø)`
manager/models/models.go	`27.39% <0.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gaius-qi · 2025-06-20T02:52:46Z

@e-ngo I think the change to longtext is good.

After the scheduler is started, it will maintain keepalive with the manager. If the scheduler down and the keepalive connection with the manager is disconnected, the scheduler instance status in the database will become inactive.

However, for problem A, when the manager down, the scheduler restart cannot notify the manager that it has down. So unable to update the status of scheduler instance in the database to inactive.

I think the solution to this problem is to scan the scheduler and seed client tables regularly in database, and automatically delete the row in the database if the scheduler and seed client do not report for more than 2 keepalive intervals.

e-ngo · 2025-06-20T06:43:57Z

Yea, that makes sense. Can make that change

yxxhero · 2025-06-21T03:51:46Z

@e-ngo just do it

chlins · 2025-07-03T11:01:36Z

@e-ngo I think the change to longtext is good.

After the scheduler is started, it will maintain keepalive with the manager. If the scheduler down and the keepalive connection with the manager is disconnected, the scheduler instance status in the database will become inactive.

However, for problem A, when the manager down, the scheduler restart cannot notify the manager that it has down. So unable to update the status of scheduler instance in the database to inactive.

I think the solution to this problem is to scan the scheduler and seed client tables regularly in database, and automatically delete the row in the database if the scheduler and seed client do not report for more than 2 keepalive intervals.

@e-ngo Hi, I think the issue @gaius-qi mentioned will be resolved by #4179, so this PR can be focused on the db column type change, so could you update your PR to only retain the "longtext" change, then we can merge it.

gaius-qi

LGTM

chlins · 2025-07-03T12:46:00Z

@e-ngo We will merge it and then restore the change for removing old entries.

chlins

lgtm

e-ngo added 2 commits June 19, 2025 11:58

Tune dragonfly to fit our usecase by:

c483491

* use longtext to fit larger clusters * fix scheduler and seed-client register logic

Remove commented out lines.

c5a3edb

e-ngo requested a review from a team as a code owner June 19, 2025 19:08

e-ngo requested review from liubin, gaius-qi and cndoit18 June 19, 2025 19:08

github-actions bot assigned chlins, hyy0322 and xujihui1985 Jun 19, 2025

gaius-qi added the enhancement New feature or request label Jun 20, 2025

gaius-qi added this to the v2.3.0 milestone Jun 20, 2025

gaius-qi modified the milestones: v2.3.0, v2.4.0 Jul 1, 2025

gaius-qi approved these changes Jul 3, 2025

View reviewed changes

chlins approved these changes Jul 3, 2025

View reviewed changes

chlins merged commit b66c49f into dragonflyoss:main Jul 3, 2025
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tune dragonfly to Remove Old Entries & Use Longtext to fit larger warming jobs #4145

Tune dragonfly to Remove Old Entries & Use Longtext to fit larger warming jobs #4145

Uh oh!

e-ngo commented Jun 19, 2025

Uh oh!

codecov bot commented Jun 20, 2025 •

edited

Loading

Uh oh!

gaius-qi commented Jun 20, 2025

Uh oh!

e-ngo commented Jun 20, 2025

Uh oh!

yxxhero commented Jun 21, 2025

Uh oh!

chlins commented Jul 3, 2025

Uh oh!

gaius-qi left a comment

Uh oh!

chlins commented Jul 3, 2025

Uh oh!

chlins left a comment

Uh oh!

Uh oh!

Uh oh!

Tune dragonfly to Remove Old Entries & Use Longtext to fit larger warming jobs #4145

Tune dragonfly to Remove Old Entries & Use Longtext to fit larger warming jobs #4145

Uh oh!

Conversation

e-ngo commented Jun 19, 2025

Description

Related Issue

Motivation and Context

Screenshots (if appropriate)

Types of changes

Checklist

Uh oh!

codecov bot commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gaius-qi commented Jun 20, 2025

Uh oh!

e-ngo commented Jun 20, 2025

Uh oh!

yxxhero commented Jun 21, 2025

Uh oh!

chlins commented Jul 3, 2025

Uh oh!

gaius-qi left a comment

Choose a reason for hiding this comment

Uh oh!

chlins commented Jul 3, 2025

Uh oh!

chlins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jun 20, 2025 •

edited

Loading