Skip to content

Conversation

makasim
Copy link
Contributor

@makasim makasim commented Jul 23, 2025

Describe Your Changes

Currently, all errors that occur during the handshake and dial phases (except for timeouts) are retried in execOnConnWithPossibleRetry. However, such errors typically result from network issues or CPU exhaustion on the storage side. In both cases, retrying is unlikely to succeed and may instead contribute to additional, unnecessary load on the system.

This PR disables retries for all errors encountered during the handshake and dial process. The goal is to avoid redundant retry attempts in scenarios where they are unlikely to help and may worsen the underlying problem.

Related to #9345

Checklist

The following checks are mandatory:

@makasim makasim changed the base branch from master to cluster July 23, 2025 07:56
@makasim makasim self-assigned this Jul 23, 2025
@makasim makasim changed the title lib/netstorage: do not retry "cannot obtain connection from the pool" errors app/vmselect/netstorage: do not retry "cannot obtain connection from the pool" errors Jul 23, 2025
@makasim makasim force-pushed the netstorage-do-not-retry-dial-handshake-errors branch 2 times, most recently from 85ef3b0 to 0b3f3d8 Compare July 23, 2025 09:15
…the pool" errors

Currently, all errors that occur during the handshake and dial phases
(except for timeouts) are retried in execOnConnWithPossibleRetry.
However, such errors typically result from network issues or CPU
exhaustion on the storage side. In both cases, retrying is unlikely to
succeed and may instead contribute to additional, unnecessary load on
the system.

This PR disables retries for all errors encountered during the handshake
and dial process. The goal is to avoid redundant retry attempts in
scenarios where they are unlikely to help and may worsen the underlying
problem.
@makasim makasim force-pushed the netstorage-do-not-retry-dial-handshake-errors branch from 0b3f3d8 to 3878cbc Compare July 23, 2025 09:16
@makasim makasim marked this pull request as ready for review July 23, 2025 09:25
@makasim makasim requested review from valyala, zekker6, f41gh7 and rtm0 July 23, 2025 09:25
@makasim
Copy link
Contributor Author

makasim commented Jul 25, 2025

@f41gh7 @zekker6 please take a look

Copy link
Contributor

@f41gh7 f41gh7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@makasim makasim merged commit ff446f8 into cluster Jul 28, 2025
8 checks passed
@makasim makasim deleted the netstorage-do-not-retry-dial-handshake-errors branch July 28, 2025 12:29
makasim added a commit that referenced this pull request Jul 28, 2025
…the pool" errors (#9484)

Currently, all errors that occur during the handshake and dial phases
(except for timeouts) are retried in
[execOnConnWithPossibleRetry](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/0c4062b7276cecb3345197f6aa1181cfab1f2f00/app/vmselect/netstorage/netstorage.go#L2431).
However, such errors typically result from network issues or CPU
exhaustion on the storage side. In both cases, retrying is unlikely to
succeed and may instead contribute to additional, unnecessary load on
the system.

This PR disables retries for all errors encountered during the handshake
and dial process. The goal is to avoid redundant retry attempts in
scenarios where they are unlikely to help and may worsen the underlying
problem.

Related to
#9345

The following checks are **mandatory**:

- [ ] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [ ] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants