Skip to content

Conversation

JeffCyr
Copy link
Contributor

@JeffCyr JeffCyr commented Dec 17, 2015

Issue #1538 Akka.Remote performance - ForkJoinDispatcher

This PR contains a refactoring of the DedicatedThreadPool. The main improvement is to use a shared work queue for all workers. The work queue is implemented by a ConcurrentQueue and a custom semaphore highly optimized for the ThreadPool scenario. (I borrowed UnfairSemaphore from the coreclr repo, it's the one used for the real CLR ThreadPool)

The new DedicatedThreadPool has ~3.5x more throughput and 32x less latency (for the worst-case).

Before the refactoring:

Throughput test (Queue 10 000 000 items and wait completion)

Global ThreadPool : 00:00:01.2247981
HeliosThreadPool : 00:00:03.8819794

Latency test (Sequentially queue 300 000 items)

Global ThreadPool : 00:00:00.4042989
HeliosThreadPool : 00:00:07.2341554

After the refactoring:

Throughput test (Queue 10 000 000 items and wait completion)

Global ThreadPool : 00:00:01.2171287
HeliosThreadPool : 00:00:01.0755531

Latency test (Sequentially queue 300 000 items)

Global ThreadPool : 00:00:00.5609632
HeliosThreadPool : 00:00:00.2240666

I commented the deadlock detection code, the option is disabled by default and enabling it was extremely dangerous. Depending on your feedback we can leave it disabled or find a better way to mitigate the pool deadlock.

@Aaronontheweb
Copy link
Member

cc @akkadotnet/developers please merge in some of our benchmarks for comparison purposes.

@Aaronontheweb
Copy link
Member

Thanks @JeffCyr ! Mind contributing these changes back to https://github.com/helios-io/DedicatedThreadPool? We ship this as a stand-alone NuGet package too ;)

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Dec 17, 2015

@Aaronontheweb Sure! I'll do that after the reviews.


public void Add(Action work)
{
m_queue.Enqueue(work);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to protect against the race with CompleteAdding() here.

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Dec 17, 2015

I fixed the race condition with CompleteAdding. It's now safe to dispose the DedicatedThreapool, all work items that returned true when QueueUserWorkItem was called will be executed, all the threads will exit when the queue is empty.

DedicatedThreadPool.Dispose only triggers the shutdown, to wait until all thread exited, you can call DedicatedThreadPool.WaitForThreadsExit().

@rogeralsing
Copy link
Contributor

Trying to review but this is beyond my understanding of threading.
cc @akkadotnet/contributors

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Dec 17, 2015

For reviewers,

UnfairSemaphore is an exact copy of the one used by the .Net ThreadPool:
https://github.com/dotnet/coreclr/blob/97433b9d153843492008652ff6b7c3bf4d9ff31c/src/vm/win32threadpool.h#L124

The _outstandingRequests handling is also borrowed from the .Net ThreadPool:
https://github.com/dotnet/coreclr/blob/bc146608854d1db9cdbcc0b08029a87754e12b49/src/mscorlib/src/System/Threading/ThreadPool.cs#L568

I'm confident these algorithms are solid.

The code in GetConsumingEnumerable and CompleteAdding is from me. To the best of my knowledge it's solid too, but you can pay special attention over these methods since they are new.

I'll post a gist of the test I did when I have time,

Feel free to ask for clarifications, I'll do my best to answer them.


public void Dispose(bool isDisposing)
[StructLayout(LayoutKind.Sequential)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be LayoutKind.Sequential. It can cause a misalignment issue in x86 when we use Interlocked64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further investigation, LayoutKind.Sequential does not play well when the fields of the class contains struct with LayoutKind.Explicit. The fields are not aligned as expected.

I will find a robust way to guarantee the proper alignment of the UnfairSemaphore's fields when I have some free time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I had to refresh my memory about memory alignment and it has nothing to do with the LayoutKind. In x86 the object in the heap are 4 byte aligned, so depending where the object is allocated, its fields might not be 8 byte aligned.

This would have no impact if the underlying architecture was a real x86, but on a x64 architecture, the Interlocked performance is affected when the memory is not 8 byte aligned.

There are a few options:

  • Leave it like that and accept the random perf in x86.
  • Use a different semaphore implementation in x86
  • Enforce the address of the field on a 8 byte boundary, but it may be hackish.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about our X86 homies - increasingly rare architecture outside of small client hardware. But do leave a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah real x86 hardware is obsolete, but x86 processes running on x64 hardware is still pretty common when you have to use legacy libs that only work in x86.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting answer by @mikedn about why the misalignment is not an issue with the CLR ThreadPool:
https://github.com/dotnet/coreclr/issues/2383#issuecomment-165957482

@mattnischan
Copy link
Contributor

BTW, if anyone is interested why the magic padding works the way it does, there's a good article on why Disruptor does it in Java: http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast_22.html?m=1

@rogeralsing
Copy link
Contributor

@JeffCyr if you got some time over, I would love to hear some suggestions on how to implement this properly:
https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka/Actor/Scheduler/DedicatedThreadScheduler.cs#L24

We dont really see any perf issues there AFAIK, but it is super hacky and if we could make that execute scheduled work with better precision, that would be a nice bonus.

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Dec 19, 2015

@mattnischan Great article, it might be a good idea to pad the _outstandingRequests counter as well.

I did some benchmarks with disruptor-net a couple of years ago and did observe random perf issue in x86. I didn't understand at that time, but it may be the same issue about misaligned InterlockedCompareExchange64.

https://github.com/disruptor-net/Disruptor-net/blob/8fc6cde3b6f806d05c16b931113cd82ba68180d6/Disruptor/Sequence.cs#L55

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Dec 19, 2015

@rogeralsing Sure! I'll take a look.

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Dec 20, 2015

@rogeralsing I created issue #1593 for the DedicatedThreadScheduler perf review.

@annymsMthd
Copy link
Contributor

Looks like a couple of builds failed. Any info on these @Aaronontheweb ? It doesn't look like the performance tests generate a summary page yet

{
internal class DedicatedThreadPoolSupervisor : IDisposable
//internal class DedicatedThreadPoolSupervisor : IDisposable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go ahead and just delete the code here if we're not going to use it.

@Aaronontheweb
Copy link
Member

@JeffCyr sent you an email for an idea on how we can review this and get it merged

@JeffCyr JeffCyr force-pushed the experimental-dedicatedthreadpool branch from b56002d to f3e49f0 Compare January 8, 2016 18:06
@JeffCyr
Copy link
Contributor Author

JeffCyr commented Jan 8, 2016

@Aaronontheweb I rebased the branch, deleted commented code and added some comments. Ready to go!

@Aaronontheweb
Copy link
Member

Performance numbers from latest build of dev:

Akka.Remote.Tests.Performance.ForkJoinDispatcherRemoteMessagingThroughputSpec+OneWay

Measures the throughput of Akka.Remote over a particular transport using one-way messaging
1/7/2016 7:51:41 PM

System Info

NBench=NBench, Version=0.1.5.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4

NBench Settings

RunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01

Data


Totals

Metric Units Max Average Min StdDev
TotalCollections [Gen0] collections 23.00 22.54 22.00 0.52
TotalCollections [Gen1] collections 1.00 0.92 0.00 0.28
TotalCollections [Gen2] collections 0.00 0.00 0.00 0.00
[Counter] RemoteMessageReceived operations 10,000.00 10,000.00 10,000.00 0.00

Per-second Totals

Metric Units / s Max / s Average / s Min / s StdDev / s
TotalCollections [Gen0] collections 30.81 27.63 22.69 2.52
TotalCollections [Gen1] collections 1.40 1.14 0.00 0.36
TotalCollections [Gen2] collections 0.00 0.00 0.00 0.00
[Counter] RemoteMessageReceived operations 14,003.68 12,269.85 9,865.98 1,197.30

Raw Data

(Removed GC raw data for brevity)

[Counter] RemoteMessageReceived

Run # operations operations / s ns / operations
1 10,000.00 11,699.21 85,475.83
2 10,000.00 13,377.19 74,754.12
3 10,000.00 14,003.68 71,409.81
4 10,000.00 11,183.44 89,417.96
5 10,000.00 12,459.31 80,261.25
6 10,000.00 9,865.98 101,358.38
7 10,000.00 10,960.23 91,238.98
8 10,000.00 13,302.51 75,173.79
9 10,000.00 13,020.51 76,801.93
10 10,000.00 13,375.84 74,761.65
11 10,000.00 12,859.30 77,764.75
12 10,000.00 11,907.15 83,983.14
13 10,000.00 11,493.77 87,003.69

@Aaronontheweb
Copy link
Member

Performance numbers for THIS PR

Akka.Remote.Tests.Performance.ForkJoinDispatcherRemoteMessagingThroughputSpec+OneWay

Measures the throughput of Akka.Remote over a particular transport using one-way messaging
1/8/2016 6:33:54 PM

System Info

NBench=NBench, Version=0.1.5.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4

NBench Settings

RunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01

Data


Totals

Metric Units Max Average Min StdDev
TotalCollections [Gen0] collections 23.00 23.00 23.00 0.00
TotalCollections [Gen1] collections 1.00 0.62 0.00 0.51
TotalCollections [Gen2] collections 0.00 0.00 0.00 0.00
[Counter] RemoteMessageReceived operations 10,000.00 10,000.00 10,000.00 0.00

Per-second Totals

Metric Units / s Max / s Average / s Min / s StdDev / s
TotalCollections [Gen0] collections 43.57 41.66 39.99 1.18
TotalCollections [Gen1] collections 1.89 1.11 0.00 0.92
TotalCollections [Gen2] collections 0.00 0.00 0.00 0.00
[Counter] RemoteMessageReceived operations 18,941.62 18,113.09 17,388.91 511.60

Raw Data

(Removed GC raw data for brevity)

[Counter] RemoteMessageReceived

Run # operations operations / s ns / operations
1 10,000.00 18,745.99 53,344.75
2 10,000.00 18,941.62 52,793.80
3 10,000.00 18,007.26 55,533.17
4 10,000.00 18,860.80 53,020.03
5 10,000.00 18,051.50 55,397.05
6 10,000.00 18,121.46 55,183.18
7 10,000.00 18,138.26 55,132.09
8 10,000.00 18,194.60 54,961.37
9 10,000.00 18,309.07 54,617.74
10 10,000.00 17,700.62 56,495.20
11 10,000.00 17,505.50 57,124.91
12 10,000.00 17,504.64 57,127.70
13 10,000.00 17,388.91 57,507.93

@Aaronontheweb
Copy link
Member

Avg throughput of dev: 12,269.85 msg / s
Avg throughput of this PR: 18,113.09 msg / s

Lol holy shit

@annymsMthd
Copy link
Contributor

So much win

Aaronontheweb added a commit that referenced this pull request Jan 12, 2016
Increased performance of DedicatedThreadPool
@Aaronontheweb Aaronontheweb merged commit 3dc28bb into akkadotnet:dev Jan 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants