Skip to content

Conversation

Tema
Copy link
Contributor

@Tema Tema commented Apr 19, 2025

What problem does this PR solve?

Issue Number: Close #9188

What is changed and how does it work?

The change fixes the problems outlined in #9188 (comment)

  • Don't block client stream on receiving next request and allow to return error from proxy
  • Allow to return proxy error to all clients served by the proxy error
  • Don't cancel proxy stream after original client stream is closed

Tested using the same way as reported in the issue and observe subsecond impact only now:

[ 67s ] thds: 64 tps: 2007.63 qps: 2007.63 (r/w/o: 0.00/2007.63/0.00) lat (ms,95%): 116.80 err/s: 0.00 reconn/s: 0.00
[ 68s ] thds: 64 tps: 1962.21 qps: 1962.21 (r/w/o: 0.00/1962.21/0.00) lat (ms,95%): 121.08 err/s: 0.00 reconn/s: 0.00
[ 69s ] thds: 64 tps: 1984.16 qps: 1984.16 (r/w/o: 0.00/1984.16/0.00) lat (ms,95%): 116.80 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 64 tps: 504.00 qps: 504.00 (r/w/o: 0.00/504.00/0.00) lat (ms,95%): 646.19 err/s: 0.00 reconn/s: 0.00
[ 71s ] thds: 64 tps: 1607.99 qps: 1607.99 (r/w/o: 0.00/1607.99/0.00) lat (ms,95%): 125.52 err/s: 0.00 reconn/s: 0.00
[ 72s ] thds: 64 tps: 2010.98 qps: 2010.98 (r/w/o: 0.00/2010.98/0.00) lat (ms,95%): 116.80 err/s: 0.00 reconn/s: 0.00
[ 73s ] thds: 64 tps: 1965.68 qps: 1965.68 (r/w/o: 0.00/1965.68/0.00) lat (ms,95%): 123.28 err/s: 0.00 reconn/s: 0.00

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

None.

Signed-off-by: artem_danilov <artem.danilov@airbnb.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 19, 2025
Copy link
Contributor

ti-chi-bot bot commented Apr 19, 2025

Hi @Tema. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 19, 2025
@ti-chi-bot ti-chi-bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 19, 2025
@ti-chi-bot
Copy link
Member

Now you can start all CI jobs with /test all in comment or query the triggers with /test ?

@Tema
Copy link
Contributor Author

Tema commented Apr 21, 2025

@rleungx @bufferflies @niubell PTAL

@Tema
Copy link
Contributor Author

Tema commented Apr 21, 2025

/retest

1 similar comment
@Tema
Copy link
Contributor Author

Tema commented Apr 21, 2025

/retest

Copy link

codecov bot commented Apr 21, 2025

Codecov Report

Attention: Patch coverage is 70.96774% with 18 lines in your changes missing coverage. Please review.

Project coverage is 76.02%. Comparing base (16304b7) to head (316b0cf).
Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #9219   +/-   ##
=======================================
  Coverage   76.01%   76.02%           
=======================================
  Files         470      470           
  Lines       73034    73045   +11     
=======================================
+ Hits        55516    55531   +15     
+ Misses      14074    14069    -5     
- Partials     3444     3445    +1     
Flag Coverage Δ
unittests 76.02% <70.96%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@bufferflies bufferflies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, Maybe need to add intergrate test?

Signed-off-by: artem_danilov <artem.danilov@airbnb.com>
@Tema
Copy link
Contributor Author

Tema commented Apr 22, 2025

/retest

@@ -36,12 +37,20 @@ const (
maxMergeRequests = 10000
// DefaultTSOProxyTimeout defines the default timeout value of TSP Proxying
DefaultTSOProxyTimeout = 3 * time.Second
// TSOProxyStreamIdleTimeout defines how long Proxy stream will live if no request is received
TSOProxyStreamIdleTimeout = 5 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This const doesn't need to be exported.

}
defer cancel()

requests := make([]Request, maxMergeRequests+1)
needUpdateServicePrimaryAddr := len(tsoPrimaryWatchers) > 0 && tsoPrimaryWatchers[0] != nil
for {
noProxyRequestsTimer := time.NewTimer(TSOProxyStreamIdleTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create a timer inside the loop? Is it possible to reuse the same timer always?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me pull it outside of the loop and just reset at this line instead

case <-dispatcherCtx.Done():
return
}
tsDeadlineCh <- dl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check dispatcherCtx.Done() anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatcherCtx can be cancelled in this function only or through the parent, which is grpc service which lifetime aligns with the processes lifetime. Previously it was parented to grpc stream, but this is exactly what this PR tries to avoid.Let me know if it still make sense to check for it, I just felt that this is no point in it anymore and cleaned up the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean, but in all cases involving channel sending and other potential blocking points, we are used to pairing them with select to ensure proper exit. This change makes me a bit uneasy, but if logically there is no possibility of it happening, I am okay with this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It make sense. It is also worth to have it in case the passing context evolves and become possible to cancel as the code evolves. I've returned it in the other potential blocking points, but this particular line is non-blocking so I don't see any reason to have it. But please let me know if I miss something and you strongly believe we should have it on this particular line as well.

if !loaded {
log.Info("Start new tso proxy dispatcher", zap.String("forwardedHost", req.getForwardedHost()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PD's logs, we don't use capitalized initials; it's best to keep it consistent with other places. Please forgive my OCD 😄.

}
case <-dispatcherCtx.Done():
case <-noProxyRequestsTimer.C:
log.Info("Close tso proxy as it is idle for a while")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment on lines 514 to 515
var request *pdpb.TsoRequest
var err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var request *pdpb.TsoRequest
var err error
var (
request *pdpb.TsoRequest
err error
)

forwarder = newTSOForwarder(stream)
tsoStreamErr error
// The following are tso forward stream related variables.
tsoRequestProxyQueue context.Context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using a name to emphasize that it is a context rather than the queue itself? Like tsoRequestProxyQueueCtx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, I forgot to rename it after changing struct to Context. Will do.

@Tema
Copy link
Contributor Author

Tema commented Apr 23, 2025

/retest

1 similar comment
@Tema
Copy link
Contributor Author

Tema commented Apr 23, 2025

/retest

Copy link
Member

@JmPotato JmPotato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reset LGTM.

case <-dispatcherCtx.Done():
return
}
tsDeadlineCh <- dl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean, but in all cases involving channel sending and other potential blocking points, we are used to pairing them with select to ensure proper exit. This change makes me a bit uneasy, but if logically there is no possibility of it happening, I am okay with this change.

@bufferflies
Copy link
Contributor

BTW, Maybe need to add intergrate test?

@Tema, I have added one test for this case; you can ignore the test unit. link #9243

@ti-chi-bot ti-chi-bot bot added dco-signoff: no Indicates the PR's author has not signed dco. and removed dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 23, 2025
Tema added 2 commits April 23, 2025 14:22
Signed-off-by: artem_danilov <artem.danilov@airbnb.com>
Signed-off-by: artem_danilov <artem.danilov@airbnb.com>
@Tema Tema force-pushed the tso-proxy-timeout branch from fabeae7 to 1f30701 Compare April 23, 2025 23:28
Copy link
Contributor

ti-chi-bot bot commented Apr 25, 2025

@bufferflies: You cannot manually add or delete the cherry pick branch category labels. It will be added automatically by bot when the PR is created.

In response to adding label named type/cherry-pick-for-release-8.5.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Apr 25, 2025
close tikv#9188

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #9259.
But this PR has conflicts, please resolve them!

@Tema
Copy link
Contributor Author

Tema commented Apr 25, 2025

@JmPotato @bufferflies I'm still working on comprehensive unit tests for tso proxy server and cut a separate PR soon.

@bufferflies
Copy link
Contributor

@JmPotato @bufferflies I'm still working on comprehensive unit tests for tso proxy server and cut a separate PR soon.

Great, thanks for your contributions for PD.

@bufferflies bufferflies added affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. and removed affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. labels Apr 25, 2025
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Apr 25, 2025
close tikv#9188

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #9262.
But this PR has conflicts, please resolve them!

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #9263.
But this PR has conflicts, please resolve them!

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Apr 25, 2025
close tikv#9188

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Apr 25, 2025
close tikv#9188

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #9264.
But this PR has conflicts, please resolve them!

ti-chi-bot bot added a commit that referenced this pull request Apr 27, 2025
close #9188

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: 童剑 <1045931706@qq.com>

Co-authored-by: Artem Danilov <artem.danilov@airbnb.com>
Co-authored-by: 童剑 <1045931706@qq.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
JmPotato pushed a commit to JmPotato/pd that referenced this pull request Jul 23, 2025
close tikv#9188

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: JmPotato <github@ipotato.me>
ti-chi-bot bot pushed a commit that referenced this pull request Jul 23, 2025
close #9188

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: JmPotato <github@ipotato.me>

Co-authored-by: Artem Danilov <artem.danilov@airbnb.com>
Co-authored-by: JmPotato <github@ipotato.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tidb_enable_tso_follower_proxy causes 10s unavailability during graceful leader transfer
5 participants