Skip to content

Conversation

lhy1024
Copy link
Contributor

@lhy1024 lhy1024 commented Oct 28, 2024

What problem does this PR solve?

Issue Number: Close #8748, close #8426

What is changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test

Release note

None.

Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 28, 2024
Copy link

codecov bot commented Oct 28, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 76.22%. Comparing base (c5812ee) to head (dd79eb6).
Report is 150 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8749      +/-   ##
==========================================
- Coverage   76.26%   76.22%   -0.04%     
==========================================
  Files         461      462       +1     
  Lines       70392    70429      +37     
==========================================
+ Hits        53682    53687       +5     
- Misses      13356    13383      +27     
- Partials     3354     3359       +5     
Flag Coverage Δ
unittests 76.22% <93.75%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 81 to 82
regionLoadedFromDefault bool
regionLoadedFromStorage bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to distinguish these two kinds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid problems after ps.useRegionStorage switch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using string?

regionLoadedFrom: "", "etcd", "leveldb"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, we just have these two ways now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer just using a single field to store this info like:

type regionSource int

const (
	unloaded regionSource = iota
	etcd
	leveldb
)

Signed-off-by: lhy1024 <admin@liudos.us>
@@ -39,11 +40,13 @@ func newStatusHandler(svr *server.Server, rd *render.Render) *statusHandler {
// @Success 200 {object} versioninfo.Status
// @Router /status [get]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it is suitable. ref #8748 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about /member or /health?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/members return all member info according etcd.ListEtcdMembers
/health is to query the status of each node through the /ping interface to return information about all the nodes, if we don't add a new api, we won't be able to get the result of whether the load region is complete or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status is more like some fixed information to me, at least for the current definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put it under /status, which is appropriate from the name, but currently status returns compiled information.

Yes, as I said in the issue.

Version string `json:"version"`
GitHash string `json:"git_hash"`
StartTimestamp int64 `json:"start_timestamp"`
AreRegionsLoaded bool `json:"are_regions_loaded"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to use loaded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loaded bool json:"loaded"

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
@@ -31,6 +31,7 @@ type Status struct {
Version string `json:"version"`
GitHash string `json:"git_hash"`
StartTimestamp int64 `json:"start_timestamp"`
Loaded bool `json:"loaded"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer RegionLoaded

Comment on lines 81 to 82
regionLoadedFromDefault bool
regionLoadedFromStorage bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using string?

regionLoadedFrom: "", "etcd", "leveldb"

Comment on lines 140 to 154
if atomic.LoadInt32(&ps.useRegionStorage) == 0 {
return ps.Storage.LoadRegions(ctx, f)
err := ps.Storage.LoadRegions(ctx, f)
if err == nil {
ps.regionLoadedFromDefault = true
}
return err
}

ps.mu.Lock()
defer ps.mu.Unlock()
if !ps.regionLoaded {
if !ps.regionLoadedFromStorage {
if err := ps.regionStorage.LoadRegions(ctx, f); err != nil {
return err
}
ps.regionLoaded = true
ps.regionLoadedFromStorage = true
}
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can use *coreStorage.LoadRegions() directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't check ps.regionLoadedFromStorage

Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
useRegionStorage int32
regionLoaded bool
mu syncutil.Mutex
useRegionStorage int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that safe to switch between using or not using RegionStore in the master branch or the latest releases now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultUseRegionStorage is true and only check switch when it was just elected leader.

Signed-off-by: lhy1024 <admin@liudos.us>
useRegionStorage int32
regionLoaded bool
mu syncutil.Mutex
useRegionStorageFlag int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use bool?

Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024
Copy link
Contributor Author

lhy1024 commented Nov 18, 2024

/test pull-integration-realcluster-test

@@ -44,6 +45,7 @@ func (h *statusHandler) GetPDStatus(w http.ResponseWriter, _ *http.Request) {
GitHash: versioninfo.PDGitHash,
Version: versioninfo.PDReleaseVersion,
StartTimestamp: h.svr.StartTimestamp(),
RegionLoaded: storage.AreRegionsLoaded(h.svr.GetStorage()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a more general field here like "Ready" which we can extend in the future if we find more conditions under which PD can't serve?

Does PDStatus API exposed by all components of disaggregated PD?

Any chance we introduce a new API /ready which we can use across all TiDB components in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tema, thanks for your advice
Q1: A general field will be helpful to expand different scenarios, but we found that currently there is only RegionLoaded is needed. If there are other situations in the future, we tend to add new fields with more accurate meanings.
Q2: Standalone PD or Scheduling Service will expose this API(especially the RegionLoaded field), others such as TSO Service don't.
Q3: IMO, it's good we can use the same interface to query service status, but the fields may be different. The benefits of unification may not be obvious, but it is indeed a better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally you want to decouple data plane (pd) from control plane (tidb-operator, tiUp), so that if you make changes to one, you don't have to upgrade other and it can start leveraging any improvements right away. That is why usually the common convention is to use /ready across all components. This way the control-plane can have a common logic for rolling restart of all components, without need to necessarily overcustomize each of them. This is not just about PD microservices but many other components of TiDB cluster (tidb, dm, cdc, tikv). Some of them still require additional customization, but it would be nice to keep it to the minimum. Did you talk to tidb-operator team? Don't they want to have a /ready across all components. I've herd there is a work on tidb-operator v2 or something like that. This might be a good opportunity to standartize protocol between control plane and data place as once it is out there, it is hard to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Tema, for TiDB Operator, it's better to use a general API (like /ready) to check the status in most cases. If this API can't meet some other special cases in the future, then we can consider to add/use some other APIs.

For this RegionLoaded case, in fact, TiDB Operator just wants to know when it can restart the next instance safely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niubell can we land this one and open a separate task for a more general /ready endpoint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niubell can we land this one and open a separate task for a more general /ready endpoint?
@Tema Agreed, we will merge this PR first. After the general /ready proposal for all components is finalized, PD will also add the /ready API, cc @zhangjinpeng87 @csuzhangxc

@lhy1024
Copy link
Contributor Author

lhy1024 commented Dec 9, 2024

@rleungx @JmPotato @okJiang PTAL

@lhy1024
Copy link
Contributor Author

lhy1024 commented Dec 9, 2024

/hold wait for @zhangjinpeng87

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 9, 2024
@lhy1024 lhy1024 removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2024
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Dec 13, 2024
@@ -31,6 +31,7 @@ type Status struct {
Version string `json:"version"`
GitHash string `json:"git_hash"`
StartTimestamp int64 `json:"start_timestamp"`
RegionLoaded bool `json:"region_loaded"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the operator/tiup uses this API and PD doesn't provide this field? Will it block the upgrade process?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this part is implemented yet, but it should be backward compatible and pass readiness check if it is not provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, we will use /ready api directly.

@lhy1024
Copy link
Contributor Author

lhy1024 commented Dec 18, 2024

PTAL @rleungx @okJiang

@lhy1024 lhy1024 requested review from okJiang and rleungx December 18, 2024 05:20
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 18, 2024
Copy link
Contributor

ti-chi-bot bot commented Dec 18, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-12-13 07:08:04.636906218 +0000 UTC m=+595074.725708756: ☑️ agreed by okJiang.
  • 2024-12-18 06:57:43.782319926 +0000 UTC m=+1026453.871122464: ☑️ agreed by rleungx.

@ti-chi-bot ti-chi-bot bot added the approved label Dec 18, 2024
@rleungx
Copy link
Member

rleungx commented Dec 18, 2024

/hold

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 18, 2024
@rleungx
Copy link
Member

rleungx commented Dec 18, 2024

Let's wait for @okJiang to give an approve.

Copy link
Contributor

ti-chi-bot bot commented Dec 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: okJiang, rleungx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rleungx
Copy link
Member

rleungx commented Dec 19, 2024

/hold cancel

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2024
@ti-chi-bot ti-chi-bot bot merged commit da4b43a into tikv:master Dec 19, 2024
25 checks passed
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label May 13, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #9319.
But this PR has conflicts, please resolve them!

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request May 13, 2025
close tikv#8426, close tikv#8748

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot pushed a commit that referenced this pull request May 14, 2025
close #8426, close #8748

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: artem_danilov <artem.danilov@airbnb.com>

Co-authored-by: lhy1024 <liuhanyang@pingcap.com>
Co-authored-by: lhy1024 <admin@liudos.us>
Co-authored-by: Artem Danilov <artem.danilov@airbnb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve availability during pd rolling restart Alleviate the latency impact when reloading PD
8 participants