Skip to content

Conversation

Tema
Copy link
Contributor

@Tema Tema commented Apr 26, 2025

What problem does this PR solve?

Issue Number: ref #9188

What is changed and how does it work?

#9219 fixed error propagation, but had a bug that it propagates an error only for the first request on the stream. This PR fixes the bug and adds test coverage.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

None.

Signed-off-by: artem_danilov <artem.danilov@airbnb.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 26, 2025
Copy link
Contributor

ti-chi-bot bot commented Apr 26, 2025

Hi @Tema. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 26, 2025
@ti-chi-bot ti-chi-bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 26, 2025
@ti-chi-bot
Copy link
Member

Now you can start all CI jobs with /test all in comment or query the triggers with /test ?

@Tema
Copy link
Contributor Author

Tema commented Apr 26, 2025

@JmPotato @bufferflies I've found a bug while writing tests for #9219 PTAL

@Tema
Copy link
Contributor Author

Tema commented Apr 26, 2025

/retest

1 similar comment
@Tema
Copy link
Contributor Author

Tema commented Apr 26, 2025

/retest

Signed-off-by: artem_danilov <artem.danilov@airbnb.com>
Copy link

codecov bot commented Apr 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.11%. Comparing base (2bc6a72) to head (5f539f8).
Report is 10 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9268      +/-   ##
==========================================
+ Coverage   75.97%   76.11%   +0.14%     
==========================================
  Files         470      472       +2     
  Lines       73051    73264     +213     
==========================================
+ Hits        55503    55768     +265     
+ Misses      14101    14050      -51     
+ Partials     3447     3446       -1     
Flag Coverage Δ
unittests 76.11% <100.00%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Tema added 2 commits April 28, 2025 15:25
Signed-off-by: artem_danilov <artem.danilov@airbnb.com>
Signed-off-by: artem_danilov <artem.danilov@airbnb.com>
@Tema
Copy link
Contributor Author

Tema commented Apr 28, 2025

@JmPotato @bufferflies I've done with sysbench tests as well. Now it looks good with only subsecond impact on leader transfer. PTAL

-----
# 1 thread to 1 tidb
sysbench oltp_update_non_index --non_index_updates=1 --mysql-ignore-errors=1105 --skip_trx=true --tables=1 --table_size=16384  --threads=1 --mysql-host=${HOST} --mysql-user=root --mysql-db=test --mysql-port=4000 --report-interval=1 --time=300 run

[ 22s ] thds: 1 tps: 113.95 qps: 113.95 (r/w/o: 0.00/113.95/0.00) lat (ms,95%): 11.04 err/s: 0.00 reconn/s: 0.00
[ 23s ] thds: 1 tps: 113.02 qps: 113.02 (r/w/o: 0.00/113.02/0.00) lat (ms,95%): 11.24 err/s: 0.00 reconn/s: 0.00

[ 24s ] thds: 1 tps: 67.01 qps: 67.01 (r/w/o: 0.00/67.01/0.00) lat (ms,95%): 16.71 err/s: 0.00 reconn/s: 0.00

[ 25s ] thds: 1 tps: 116.02 qps: 116.02 (r/w/o: 0.00/116.02/0.00) lat (ms,95%): 11.45 err/s: 0.00 reconn/s: 0.00
[ 26s ] thds: 1 tps: 114.99 qps: 114.99 (r/w/o: 0.00/114.99/0.00) lat (ms,95%): 11.24 err/s: 0.00 reconn/s: 0.00

-----
# 64 threads across 10 tidb pods
sysbench oltp_update_non_index --non_index_updates=1 --mysql-ignore-errors=1105 --skip_trx=true --tables=1 --table_size=16384  --threads=64 --mysql-host=${HOST} --mysql-user=root --mysql-db=test --mysql-port=4000 --report-interval=1 --time=300 run

[ 8s ] thds: 64 tps: 2030.01 qps: 2030.01 (r/w/o: 0.00/2030.01/0.00) lat (ms,95%): 118.92 err/s: 0.00 reconn/s: 0.00
[ 9s ] thds: 64 tps: 1914.99 qps: 1914.99 (r/w/o: 0.00/1914.99/0.00) lat (ms,95%): 121.08 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 64 tps: 1924.00 qps: 1924.00 (r/w/o: 0.00/1924.00/0.00) lat (ms,95%): 123.28 err/s: 0.00 reconn/s: 0.00

[ 11s ] thds: 64 tps: 1390.00 qps: 1390.00 (r/w/o: 0.00/1390.00/0.00) lat (ms,95%): 158.63 err/s: 2.00 reconn/s: 0.00

[ 12s ] thds: 64 tps: 2030.01 qps: 2030.01 (r/w/o: 0.00/2030.01/0.00) lat (ms,95%): 116.80 err/s: 0.00 reconn/s: 0.00
[ 13s ] thds: 64 tps: 2060.02 qps: 2060.02 (r/w/o: 0.00/2060.02/0.00) lat (ms,95%): 114.72 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 64 tps: 2007.01 qps: 2007.01 (r/w/o: 0.00/2007.01/0.00) lat (ms,95%): 121.08 err/s: 0.00 reconn/s: 0.00

@@ -74,10 +75,10 @@ func (s *TSODispatcher) DispatchRequest(serverCtx context.Context, req Request,
key := req.getForwardedHost()
val, loaded := s.dispatchChs.Load(key)
if !loaded {
val = tsoRequestProxyQueue{requestCh: make(chan Request, maxMergeRequests+1)}
val = &tsoRequestProxyQueue{requestCh: make(chan Request, maxMergeRequests+1)}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the context was not actually stored in map due to this bug, so the context never was return from this method except the first request.

Copy link
Member

@JmPotato JmPotato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest LGTM.

s.assertReceiveError(re, "pd is not leader of cluster")

// verify fails faster than 3 second timeout, otherwise the unavailable time will be too long.
re.Less(time.Since(start), time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment suggests, should this be changed to re.Less(time.Since(start), 3*time.Second)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is 3 seconds indeed, but the test verifies that it is below timeout. I don't really want to hardcode timeout for this test, it just verifies that it is "instant" failure, not based on timeout. Please let me know if change is still required.

s.assertReceiveError(re, "Canceled")

// verify fails faster than 3 second timeout, otherwise the unavailable time will be too long.
re.Less(time.Since(start), time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor

@bufferflies bufferflies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

s.serverCancel()

start := time.Now()
s.assertReceiveError(re, "Canceled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s.assertReceiveError(re, "Canceled")
s.assertReceiveError(re, "Cancelled")

Copy link
Contributor Author

@Tema Tema May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is go sdk error: "rpc error: code = Canceled desc = context canceled" where "Canceled" is grpc status. Single L is a preferred spelling in American English.

https://github.com/golang/go/blob/35b4fd9f373cbe13778eb259a19c496c9c613a1f/src/context/context.go#L167

s.verifyProxyIsHealthyWith(re, s.proxyClient)
}

func (s *tsoProxyTestSuite) verifyProxyIsHealthyWith(re *require.Assertions, client pdpb.PD_TsoClient) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing the first parameter and creating it in this function?

Signed-off-by: artem_danilov <artem.danilov@airbnb.com>
@Tema
Copy link
Contributor Author

Tema commented May 6, 2025

@JmPotato @bufferflies I've addressed/replied the comments

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels May 6, 2025
Copy link
Contributor

ti-chi-bot bot commented May 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bufferflies, JmPotato

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JmPotato,bufferflies]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 7, 2025
Copy link
Contributor

ti-chi-bot bot commented May 7, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-05-06 05:56:18.172586444 +0000 UTC m=+334625.355489697: ☑️ agreed by JmPotato.
  • 2025-05-07 02:17:04.170188408 +0000 UTC m=+407871.353091667: ☑️ agreed by bufferflies.

@ti-chi-bot ti-chi-bot bot merged commit ff346b5 into tikv:master May 7, 2025
21 checks passed
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request May 7, 2025
ref tikv#9188

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #9289.
But this PR has conflicts, please resolve them!

@bufferflies bufferflies removed the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label May 7, 2025
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Jul 7, 2025
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Jul 7, 2025
ref tikv#9188

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #9495.
But this PR has conflicts, please resolve them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants