-
Notifications
You must be signed in to change notification settings - Fork 806
Description
What version of gRPC and what language are you using?
Grpc.Net.Client 2.70.0, C#
What operating system (Linux, Windows,...) and version?
Debian (Docker image mcr.microsoft.com/dotnet/aspnet:8.0)
What runtime / compiler are you using (e.g. .NET Core SDK version dotnet --info
)
Docker image mcr.microsoft.com/dotnet/sdk:8.0
What did you do?
If possible, provide a recipe for reproducing the error. Try being specific and include code snippets if helpful.
Application setup
Our GrpcClient is registered simply as such using Grpc.Net.ClientFactory
version 2.66
:
services.AddGrpcClient<Client>(o =>
{
o.Address = new Uri(<url>);
});
Issue description
Our application is having intermittent RPC errors with exception System.Threading.Tasks.TaskCanceledException: A task was canceled.
. Investigating the logs it appears to happen when requests are picking a subchannel that has previously disconnected.
Full log example:
2025-03-17 17:01:24.684 Req X: Successfully picked subchannel id '"1-1"' with address "authorization-api.opa:8081". Transport status: ActiveStream
2025-03-17 17:01:26.777 Subchannel id '"1-1"' state changed to Idle. Detail: '"Disconnected."'.
2025-03-17 17:01:26.777 Channel state updated to Idle.
2025-03-17 17:01:26.777 Channel picker updated.
2025-03-17 17:02:07.983 Req1: Picked subchannel id '"1-1"' doesn't have a current address.
2025-03-17 17:02:07.983 Req2: Subchannel id '"1-1"' state changed to Connecting. Detail: '"Connection requested."'.
2025-03-17 17:02:07.983 Req2: Channel state updated to Connecting.
2025-03-17 17:02:07.983 Req2: Channel picker updated.
2025-03-17 17:02:07.983 Req2: Picked subchannel id '"1-1"' doesn't have a current address.
2025-03-17 17:02:07.983 Req3: Picked queued.
2025-03-17 17:02:07.983 Req1: Picked queued.
2025-03-17 17:02:07.983 Req2: Picked queued.
2025-03-17 17:02:07.983 Subchannel id '"1-1"' connecting to transport.
2025-03-17 17:02:07.984 Subchannel id '"1-1"' canceling connect.
2025-03-17 17:02:07.984 Subchannel id '"1-1"' queuing connect because a connect is already in progress.
2025-03-17 17:02:07.985 Subchannel id '"1-1"' error connecting to socket "authorization-api.opa:8081".
2025-03-17 17:02:07.985 Subchannel id '"1-1"' state changed to TransientFailure. Detail: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.985 Channel state updated to TransientFailure.
2025-03-17 17:02:07.985 Channel picker updated.
2025-03-17 17:02:07.985 Subchannel id '"1-1"' connect canceled.
2025-03-17 17:02:07.985 Subchannel id '"1-1"' connecting to transport.
2025-03-17 17:02:07.985 Subchannel id '"1-1"' state changed to Connecting. Detail: '"Connecting to socket."'.
2025-03-17 17:02:07.985 Channel state updated to Connecting.
2025-03-17 17:02:07.985 Channel picker updated.
2025-03-17 17:02:07.988 Subchannel id '"1-1"' connected to socket "authorization-api.opa:8081".
2025-03-17 17:02:07.988 Req3: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988 Req2: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988 Req1: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988 Subchannel id '"1-1"' state changed to Ready. Detail: '"Successfully connected to socket."'.
2025-03-17 17:02:07.988 Channel state updated to Ready.
2025-03-17 17:02:07.988 Channel picker updated.
2025-03-17 17:02:07.993 Req3: Call failed with gRPC error status. Status code: 'Unavailable', Message: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.993 Req2: Call failed with gRPC error status. Status code: 'Unavailable', Message: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.993 Req1: Call failed with gRPC error status. Status code: 'Unavailable', Message: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.995 Req1: SAS Authorization failed because there's been a RpcException.
2025-03-17 17:02:07.995 Req3: SAS Authorization failed because there's been a RpcException.
2025-03-17 17:02:07.995 Req2: SAS Authorization failed because there's been a RpcException.
I've marked logs logged from separate HTTP contexts as Req1
Req2
and Req3
.
Having briefly looked at the source code corresponding to the logs, my understanding of what is happening is:
2025-03-17 17:01:24.684 Req X: Successfully picked subchannel id '"1-1"' with address "authorization-api.opa:8081". Transport status: ActiveStream
2025-03-17 17:01:26.777 Subchannel id '"1-1"' state changed to Idle. Detail: '"Disconnected."'.
2025-03-17 17:01:26.777 Channel state updated to Idle.
The subchannel was connected but disconnects for whatever reason. This is fine.
2025-03-17 17:02:07.983 Req1: Picked subchannel id '"1-1"' doesn't have a current address.
2025-03-17 17:02:07.983 Req2: Subchannel id '"1-1"' state changed to Connecting. Detail: '"Connection requested."'.
2025-03-17 17:02:07.983 Req2: Channel state updated to Connecting.
2025-03-17 17:02:07.983 Req2: Channel picker updated.
2025-03-17 17:02:07.983 Req2: Picked subchannel id '"1-1"' doesn't have a current address.
2025-03-17 17:02:07.983 Req3: Picked queued.
2025-03-17 17:02:07.983 Req1: Picked queued.
2025-03-17 17:02:07.983 Req2: Picked queued.
2025-03-17 17:02:07.983 Subchannel id '"1-1"' connecting to transport.
One of the three requests requests a new connection on the subchannel. All 3 requests queue up for the picker.
2025-03-17 17:02:07.983 Subchannel id '"1-1"' connecting to transport.
2025-03-17 17:02:07.984 Subchannel id '"1-1"' canceling connect.
2025-03-17 17:02:07.984 Subchannel id '"1-1"' queuing connect because a connect is already in progress.
2025-03-17 17:02:07.985 Subchannel id '"1-1"' error connecting to socket "authorization-api.opa:8081".
2025-03-17 17:02:07.985 Subchannel id '"1-1"' state changed to TransientFailure. Detail: '"Error connecting to subchannel."'.
2025-03-17 17:02:07.985 Channel state updated to TransientFailure.
2025-03-17 17:02:07.985 Channel picker updated.
2025-03-17 17:02:07.985 Subchannel id '"1-1"' connect canceled.
2025-03-17 17:02:07.985 Subchannel id '"1-1"' connecting to transport.
2025-03-17 17:02:07.985 Subchannel id '"1-1"' state changed to Connecting. Detail: '"Connecting to socket."'.
2025-03-17 17:02:07.985 Channel state updated to Connecting.
2025-03-17 17:02:07.985 Channel picker updated.
2025-03-17 17:02:07.988 Subchannel id '"1-1"' connected to socket "authorization-api.opa:8081".
2025-03-17 17:02:07.988 Req3: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988 Req2: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
2025-03-17 17:02:07.988 Req1: Picked failure with status: "Status(StatusCode=\"Unavailable\", Detail=\"Error connecting to subchannel.\", DebugException=\"System.Threading.Tasks.TaskCanceledException: A task was canceled.\")"
The second line shows a connection being cancelled, followed by Channel state updated to TransientFailure.
and Channel picker updated.
. I believe this is how the TaskCanceledException
error is propagated to the 3 enqueued requests. But why is the connection cancelled? My best guess is this line shows the previous connection is always cancelled, even if it is in an idle/disconnected state, which is causing the queued requests to fail.
I didn't spend a lot of time studying the code, so apologies if I am misunderstanding. I believe it is not intentional for these requests to fail with TaskCanceledException
. I know we can set the WaitForReady
CallOptions flag to retry on connection failure, but I don't believe it should be necessary since the subchannel canceling an idle connection doesn't constitute a failure in my eyes.
An additional observation: Every instance of this error in our logs appeared in two or more errors happening at the same millisecond. There were no single errors like this. That smells like a race condition to me