Skip to content

TaskCanceledException when reconnecting after disconnect #2612

@asleire

Description

@asleire

What version of gRPC and what language are you using?

Grpc.Net.Client 2.70.0, C#

What operating system (Linux, Windows,...) and version?

Debian (Docker image mcr.microsoft.com/dotnet/aspnet:8.0)

What runtime / compiler are you using (e.g. .NET Core SDK version dotnet --info)

Docker image mcr.microsoft.com/dotnet/sdk:8.0

What did you do?

If possible, provide a recipe for reproducing the error. Try being specific and include code snippets if helpful.

Application setup

Our GrpcClient is registered simply as such using Grpc.Net.ClientFactory version 2.66:

 services.AddGrpcClient<Client>(o =>
        {
            o.Address = new Uri(<url>);
        });

Issue description

Our application is having intermittent RPC errors with exception System.Threading.Tasks.TaskCanceledException: A task was canceled.. Investigating the logs it appears to happen when requests are picking a subchannel that has previously disconnected.

Full log example:

2025-03-17 17:01:24.684	Req X: Successfully picked subchannel id '"1-1"' with address "authorization-api.opa:8081". Transport status: ActiveStream
2025-03-17 17:01:26.777	 Subchannel id '"1-1"' state changed to Idle. Detail: '"Disconnected."'.
2025-03-17 17:01:26.777	 Channel state updated to Idle.
2025-03-17 17:01:26.777	 Channel picker updated.
2025-03-17 17:02:07.983	Req1: Picked subchannel id '"1-1"' doesn't have a current address.
2025-03-17 17:02:07.983	Req2: Subchannel id '"1-1"' state changed to Connecting. Detail: '"Connection requested."'.
2025-03-17 17:02:07.983	Req2: Channel state updated to Connecting.
2025-03-17 17:02:07.983	Req2: Channel picker updated.
2025-03-17 17:02:07.983	Req2: Picked subchannel id '"1-1"' doesn't have a current address.
2025-03-17 17:02:07.983	Req3: Picked queued.
2025-03-17 17:02:07.983	Req1: Picked queued.
2025-03-17 17:02:07.983	Req2: Picked queued.
2025-03-17 17:02:07.983	 Subchannel id '"1-1"' connecting to transport.
2025-03-17 17:02:07.984	 Subchannel id '"1-1"' canceling connect.
2025-03-17 17:02:07.984	 Subchannel id '"1-1"' queuing connect because a connect is already in progress.
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' error connecting to socket "authorization-api.opa:8081".
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' state changed to TransientFailure. Detail: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.985	 Channel state updated to TransientFailure.
2025-03-17 17:02:07.985	 Channel picker updated.
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' connect canceled.
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' connecting to transport.
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' state changed to Connecting. Detail: '"Connecting to socket."'.
2025-03-17 17:02:07.985	 Channel state updated to Connecting.
2025-03-17 17:02:07.985	 Channel picker updated.
2025-03-17 17:02:07.988	 Subchannel id '"1-1"' connected to socket "authorization-api.opa:8081".
2025-03-17 17:02:07.988	Req3: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988	Req2: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988	Req1: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988	 Subchannel id '"1-1"' state changed to Ready. Detail: '"Successfully connected to socket."'.
2025-03-17 17:02:07.988	 Channel state updated to Ready.
2025-03-17 17:02:07.988	 Channel picker updated.
2025-03-17 17:02:07.993	Req3: Call failed with gRPC error status. Status code: 'Unavailable', Message: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.993	Req2: Call failed with gRPC error status. Status code: 'Unavailable', Message: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.993	Req1: Call failed with gRPC error status. Status code: 'Unavailable', Message: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.995	Req1: SAS Authorization failed because there's been a RpcException.
2025-03-17 17:02:07.995	Req3: SAS Authorization failed because there's been a RpcException.
2025-03-17 17:02:07.995	Req2: SAS Authorization failed because there's been a RpcException.

I've marked logs logged from separate HTTP contexts as Req1 Req2 and Req3.

Having briefly looked at the source code corresponding to the logs, my understanding of what is happening is:

2025-03-17 17:01:24.684	Req X: Successfully picked subchannel id '"1-1"' with address "authorization-api.opa:8081". Transport status: ActiveStream
2025-03-17 17:01:26.777	 Subchannel id '"1-1"' state changed to Idle. Detail: '"Disconnected."'.
2025-03-17 17:01:26.777	 Channel state updated to Idle.

The subchannel was connected but disconnects for whatever reason. This is fine.

2025-03-17 17:02:07.983	Req1: Picked subchannel id '"1-1"' doesn't have a current address.
2025-03-17 17:02:07.983	Req2: Subchannel id '"1-1"' state changed to Connecting. Detail: '"Connection requested."'.
2025-03-17 17:02:07.983	Req2: Channel state updated to Connecting.
2025-03-17 17:02:07.983	Req2: Channel picker updated.
2025-03-17 17:02:07.983	Req2: Picked subchannel id '"1-1"' doesn't have a current address.
2025-03-17 17:02:07.983	Req3: Picked queued.
2025-03-17 17:02:07.983	Req1: Picked queued.
2025-03-17 17:02:07.983	Req2: Picked queued.
2025-03-17 17:02:07.983	 Subchannel id '"1-1"' connecting to transport.

One of the three requests requests a new connection on the subchannel. All 3 requests queue up for the picker.

2025-03-17 17:02:07.983	 Subchannel id '"1-1"' connecting to transport.
2025-03-17 17:02:07.984	 Subchannel id '"1-1"' canceling connect.
2025-03-17 17:02:07.984	 Subchannel id '"1-1"' queuing connect because a connect is already in progress.
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' error connecting to socket "authorization-api.opa:8081".
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' state changed to TransientFailure. Detail: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.985	 Channel state updated to TransientFailure.
2025-03-17 17:02:07.985	 Channel picker updated.
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' connect canceled.
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' connecting to transport.
2025-03-17 17:02:07.985	 Subchannel id '"1-1"' state changed to Connecting. Detail: '"Connecting to socket."'.
2025-03-17 17:02:07.985	 Channel state updated to Connecting.
2025-03-17 17:02:07.985	 Channel picker updated.
2025-03-17 17:02:07.988	 Subchannel id '"1-1"' connected to socket "authorization-api.opa:8081".
2025-03-17 17:02:07.988	Req3: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988	Req2: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988	Req1: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"

The second line shows a connection being cancelled, followed by Channel state updated to TransientFailure. and Channel picker updated.. I believe this is how the TaskCanceledException error is propagated to the 3 enqueued requests. But why is the connection cancelled? My best guess is this line shows the previous connection is always cancelled, even if it is in an idle/disconnected state, which is causing the queued requests to fail.

I didn't spend a lot of time studying the code, so apologies if I am misunderstanding. I believe it is not intentional for these requests to fail with TaskCanceledException. I know we can set the WaitForReady CallOptions flag to retry on connection failure, but I don't believe it should be necessary since the subchannel canceling an idle connection doesn't constitute a failure in my eyes.

An additional observation: Every instance of this error in our logs appeared in two or more errors happening at the same millisecond. There were no single errors like this. That smells like a race condition to me

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions