Skip to content

Conversation

0x53A
Copy link
Contributor

@0x53A 0x53A commented Mar 7, 2017

ref #2546

@0x53A
Copy link
Contributor Author

0x53A commented Mar 7, 2017

The fix for this WOULD be ease:

image

setting TaskCreationOptions.RunContinuationsAsynchronously solves the issue.

But that is not available in .Net 4.5...

For myself, I have solved it by reimplementing Ask with this parameter.

@Aaronontheweb
Copy link
Member

@0x53A apologies for not responding to this earlier, just been catching myself up on the issue now...

So I think I understand the issue here: the Task is getting scheduled onto the current context, which happens to be the dedicated thread pool that the Akka.Remote actors run on... Not on the global TPL. And this becomes an issue because the "continuation" the occurs after the await blocks on the Task.Result from the second Ask... Definitely an interesting edge case. This occurs, I think, because the InboundMessageDispatcher into Akka.Remote Tells the message directly to the IActorRef, but because FutureActorRef has no mailbox this code gets executed directly when the Tell operation is invoked, hence it's running on the DedicatedThreadPool.

Honestly, I agree with @Danthar's sentiment earlier that this is a TPL design issue, but none the less we should do something about it. Holding up Akka.Remote for however long the task runs isn't cool.

Couple of ideas for things we can do without bumping to 4.6 (that's a bigger change:)

  1. Get rid of await from ResolveOne since it's not adding any value there;
  2. Schedule the Tell operation for FutureActorRef to run on the Dispatcher instead of doing it inline... Although that's a bit of a hack that I don't like and it will decrease performance of Ask...

For now, we should log it and address it as "won't fix" (until we decide on bumping the .NET Framework version, which we have an opportunity to do coming up....) This can be remedied by not calling .Result on an Ask. But it's a nice catch.

@0x53A
Copy link
Contributor Author

0x53A commented Mar 8, 2017

Hi @Aaronontheweb , thank you for taking a look at this.

The FutureActorRef directly running on the DedicatedThreadPool is not the real problem, as it itself is fast.
But the TaskCompletionSource has a "performance optimisation", where it will try to run continuations directly on the Thread that calls SetResult. So anything that continues after TaskCompletionSource.Task will run on the Thread that set the Result, which happens to be the DedicatedThreadPool.

For now, we should log it and address it as "won't fix" (until we decide on bumping the .NET Framework version, which we have an opportunity to do coming up....)

For myself I have fixed it by reimplementing Ask in my own code, which targets net 4.6

This can be remedied by not calling .Result on an Ask. But it's a nice catch.

The case with two Asks deadlocking is just an extreme case, where the issue becomes easily visible - anything executed after an await will still block the DedicatedThreadPool.

So If I were to do something like

var msg = await here.Ask<string>("ping").ConfigureAwait(false);
// pretend this is heavy cpu computation
Thread.Sleep(1);

Then the ActorSystem would stop receiving messages for the duration of the CPU work.

Or if I did

 var msg = await here.Ask<string>("ping").ConfigureAwait(false);
MessageBox.Show(msg);

then it would block until I close the MessageBox.

This only isn't an issue, if you do a second await directly afterwards, so that it thread switches again.

I think there are two possible fixes while still compiling again 4.5:

  1. Do a runtime check, and pass the flag as a numeric literal if it is available.
  2. Simulate what the flag does internally anyway, and run TaskCompletionSource.SetResult on a ThreadPool Thread:
    image

I can think of no case, where the current behaviour is desireable, so I would go with option 2.

@0x53A 0x53A force-pushed the demonstrate-deadlock branch from 9648a19 to 3f91e7f Compare March 8, 2017 08:36
@0x53A
Copy link
Contributor Author

0x53A commented Mar 8, 2017

@Aaronontheweb the current failures are
image

image

Is the CI just slightly flaky, or did I break something?

@0x53A
Copy link
Contributor Author

0x53A commented Mar 9, 2017

green, and Akka.Tests.Performance.dll passed between [23:27:45] and [23:52:40] which is 24:55

@Aaronontheweb
Copy link
Member

As it's currently implemented, this is not a good change.

Performance numbers as measured by Akka.Tests.Performance.Actor.Pattern.AskSpec+AskThroughput

Before

      Metric |           Units |             Max |         Average |             Min |          StdDev |

---------------- |---------------- |---------------- |---------------- |---------------- |---------------- |
TotalBytesAllocated | bytes | 5,940,936.00 | 4,462,434.67 | 3,050,488.00 | 1,446,372.90 |
[Counter] AskReplies | operations | 145,212.00 | 145,212.00 | 145,212.00 | 0.00 |

Per-second Totals

      Metric |       Units / s |         Max / s |     Average / s |         Min / s |      StdDev / s |

---------------- |---------------- |---------------- |---------------- |---------------- |---------------- |
TotalBytesAllocated | bytes | 6,538,034.42 | 4,796,193.94 | 3,170,045.64 | 1,686,972.32 |
[Counter] AskReplies | operations | 159,806.65 | 155,108.01 | 150,903.29 | 4,472.18 |

After

Totals

      Metric |           Units |             Max |         Average |             Min |          StdDev |

---------------- |---------------- |---------------- |---------------- |---------------- |---------------- |
TotalBytesAllocated | bytes | 8,964,024.00 | 8,235,056.00 | 7,772,192.00 | 638,924.40 |
[Counter] AskReplies | operations | 132,782.00 | 132,782.00 | 132,782.00 | 0.00 |

Per-second Totals

      Metric |       Units / s |         Max / s |     Average / s |         Min / s |      StdDev / s |

---------------- |---------------- |---------------- |---------------- |---------------- |---------------- |
TotalBytesAllocated | bytes | 9,517,265.50 | 8,635,272.03 | 8,054,319.73 | 776,544.86 |
[Counter] AskReplies | operations | 142,384.01 | 139,188.49 | 134,204.43 | 4,373.27 |

It's the number of allocations that's the concern here. It's the Task.Run plus the anonymous method allocation that's responsible for that. You could probably use a static method for the latter and shave the number of allocations down significantly.

I'm reluctant to implement this change anyway because it's a hack for an edge case that hasn't been reported in 3 years. I'd rather fix it correctly by fixing or eliminating our usage of await within Akka.Remote.

@0x53A
Copy link
Contributor Author

0x53A commented Mar 10, 2017

The thing is, this is NOT related to await, this is related to TaskCompletionSource.SetResult.
TaskCompletionSource.SetResult needs to be called outside of the DedicatedThreadPool (or RunContinuationsAsynchronously needs to be set).

This probably didn't surface before because in most usage cases the await Ask is followed by another await, which will unblock the DedicatedThreadPool immediatly.

I will try how much performance using a static method would recoup, and I will also test it with RunContinuationsAsynchronously. I assume that RunContinuationsAsynchronously would also be worse than the baseline, because internally it would need to schedule the continuations on the threadpool anyway.

If you say you won't pull this because of the performance, then I can totally understand that and feel free to close it. I have reimplemented Ask for myself, so am not blocked by this. But even though it may be an edge case, if you hit it, it is a) hard to find out the root cause and b) impossible to work around externally without reimplementing Ask.

@Aaronontheweb
Copy link
Member

@0x53A there'd be a performance hit on any continuations that occur after the original Task completes with that change in 4.6, not on the completing the task itself (which is what this spec measures)

@0x53A
Copy link
Contributor Author

0x53A commented Mar 10, 2017

@Aaronontheweb I think most users will actually run on >= net 4.6, regardless of what Akka is compiled against. What would you think about checking at runtime if we can use RunContinuationsAsynchronously and then passing it as a literal?

This is not perfect, because then changing the runtime would change the execution in a subtle way.

Edit: I found a really good explanation of the issue here: http://stackoverflow.com/questions/22579206/how-can-i-prevent-synchronous-continuations-on-a-task

@0x53A
Copy link
Contributor Author

0x53A commented Mar 11, 2017

This is the result of DEV vs RunContinuationsAsynchronously:


BASELINE

Akka.Tests.Performance.Actor.Pattern.AskSpec+AskThroughput

Tests how quickly ICanTell.Ask operations can be performed, and with how much memory
3/11/2017 2:53:50 PM

System Info

NBench=NBench, Version=0.3.4.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=8
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=8

NBench Settings

RunMode=Throughput, TestMode=Measurement
NumberOfIterations=3, MaximumRunTime=00:00:05
Concurrent=True
Tracing=True

Data


Totals

Metric Units Max Average Min StdDev
TotalBytesAllocated bytes 8,355,592.00 5,783,848.00 3,355,160.00 2,503,283.61
[Counter] AskReplies operations 181,715.00 181,715.00 181,715.00 0.00

Per-second Totals

Metric Units / s Max / s Average / s Min / s StdDev / s
TotalBytesAllocated bytes 8,475,729.31 5,886,483.53 3,426,217.41 2,527,225.63
[Counter] AskReplies operations 185,563.46 185,121.99 184,327.71 689.29

Raw Data

TotalBytesAllocated

Run # bytes bytes / s ns / bytes
1 5,640,792.00 5,757,503.87 173.69
2 8,355,592.00 8,475,729.31 117.98
3 3,355,160.00 3,426,217.41 291.87

[Counter] AskReplies

Run # operations operations / s ns / operations
1 181,715.00 185,474.81 5,391.57
2 181,715.00 184,327.71 5,425.12
3 181,715.00 185,563.46 5,388.99

with RunContinuationsAsynchronously

Akka.Tests.Performance.Actor.Pattern.AskSpec+AskThroughput

Tests how quickly ICanTell.Ask operations can be performed, and with how much memory
3/11/2017 2:56:44 PM

System Info

NBench=NBench, Version=0.3.4.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=8
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=8

NBench Settings

RunMode=Throughput, TestMode=Measurement
NumberOfIterations=3, MaximumRunTime=00:00:05
Concurrent=True
Tracing=True

Data


Totals

Metric Units Max Average Min StdDev
TotalBytesAllocated bytes 7,225,336.00 6,584,005.33 6,017,088.00 607,551.49
[Counter] AskReplies operations 155,559.00 155,559.00 155,559.00 0.00

Per-second Totals

Metric Units / s Max / s Average / s Min / s StdDev / s
TotalBytesAllocated bytes 7,261,990.46 6,658,732.76 6,091,254.13 586,187.68
[Counter] AskReplies operations 158,267.99 157,364.19 156,348.16 964.82

Raw Data

TotalBytesAllocated

Run # bytes bytes / s ns / bytes
1 7,225,336.00 7,261,990.46 137.70
2 6,017,088.00 6,091,254.13 164.17
3 6,509,592.00 6,622,953.69 150.99

[Counter] AskReplies

Run # operations operations / s ns / operations
1 155,559.00 156,348.16 6,395.98
2 155,559.00 157,476.41 6,350.16
3 155,559.00 158,267.99 6,318.40

Even this seems to have a pretty drastic perfomance impact:

x Base Change
Ops/s 185k 158k
Bytes/op 31.8 42.3

So if that is the best the BCL team managed, then I don't think I will be able to find a way to solve this issue without performance impacts.

@Aaronontheweb
Copy link
Member

@0x53A reopened this PR and resolved the conflicts. We'll let CI run and if this works I'll merge it to resolve #3119

@0x53A
Copy link
Contributor Author

0x53A commented Sep 22, 2017

I would propose to take advantage of RunContinuationsAsynchronously if possible, and only fall back to Task.Run if not.

This would still have a small performance impact, but especially the bytes/op seem to be a bit better.

Edit: I added my proposed changes here: #3120

Would be interesting to see if there are any performance differences (and how large).

@0x53A 0x53A mentioned this pull request Sep 22, 2017
@0x53A
Copy link
Contributor Author

0x53A commented Sep 22, 2017

Pulling the data from the CI server:

(TR=Task.Run, RCA=RunContinuationsAsynchronously )

x TR#1 TR#2 TR#3 RCA#1 RCA#2 RCA#3
Ops/s 84.1k 83.3k 84.8k 97.8k 96.5k 68.9k
Bytes/op 74.9 65.9 84.4 60.7 39.6 59.7

@Danthar
Copy link
Member

Danthar commented Sep 23, 2017

Just noticed #3120. Closing this one

@Aaronontheweb
Copy link
Member

@0x53A can correct me, but I believe the intention was to merge both this and #3120 @Danthar - right?

@0x53A
Copy link
Contributor Author

0x53A commented Sep 26, 2017

This one was superseded by the other one.
The original pr always used task.run.

With the other one it uses RCA if available (if the user is running on net46) and otherwise falls back to task.run.

This should (slightly) improve performance, especially allocations.

@Aaronontheweb
Copy link
Member

Got it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants