Increased performance of DedicatedThreadPool #1569

JeffCyr · 2015-12-17T05:35:46Z

Issue #1538 Akka.Remote performance - ForkJoinDispatcher

This PR contains a refactoring of the DedicatedThreadPool. The main improvement is to use a shared work queue for all workers. The work queue is implemented by a ConcurrentQueue and a custom semaphore highly optimized for the ThreadPool scenario. (I borrowed UnfairSemaphore from the coreclr repo, it's the one used for the real CLR ThreadPool)

The new DedicatedThreadPool has ~3.5x more throughput and 32x less latency (for the worst-case).

Before the refactoring:

Throughput test (Queue 10 000 000 items and wait completion)

Global ThreadPool : 00:00:01.2247981
HeliosThreadPool : 00:00:03.8819794

Latency test (Sequentially queue 300 000 items)

Global ThreadPool : 00:00:00.4042989
HeliosThreadPool : 00:00:07.2341554

After the refactoring:

Throughput test (Queue 10 000 000 items and wait completion)

Global ThreadPool : 00:00:01.2171287
HeliosThreadPool : 00:00:01.0755531

Latency test (Sequentially queue 300 000 items)

Global ThreadPool : 00:00:00.5609632
HeliosThreadPool : 00:00:00.2240666

I commented the deadlock detection code, the option is disabled by default and enabling it was extremely dangerous. Depending on your feedback we can leave it disabled or find a better way to mitigate the pool deadlock.

Aaronontheweb · 2015-12-17T05:40:46Z

cc @akkadotnet/developers please merge in some of our benchmarks for comparison purposes.

Aaronontheweb · 2015-12-17T05:44:17Z

Thanks @JeffCyr ! Mind contributing these changes back to https://github.com/helios-io/DedicatedThreadPool? We ship this as a stand-alone NuGet package too ;)

JeffCyr · 2015-12-17T05:47:29Z

@Aaronontheweb Sure! I'll do that after the reviews.

JeffCyr · 2015-12-17T06:30:57Z

src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs

+
+            public void Add(Action work)
+            {
+                m_queue.Enqueue(work);


Needs to protect against the race with CompleteAdding() here.

JeffCyr · 2015-12-17T18:11:57Z

I fixed the race condition with CompleteAdding. It's now safe to dispose the DedicatedThreapool, all work items that returned true when QueueUserWorkItem was called will be executed, all the threads will exit when the queue is empty.

DedicatedThreadPool.Dispose only triggers the shutdown, to wait until all thread exited, you can call DedicatedThreadPool.WaitForThreadsExit().

rogeralsing · 2015-12-17T20:34:21Z

Trying to review but this is beyond my understanding of threading.
cc @akkadotnet/contributors

JeffCyr · 2015-12-17T20:51:05Z

For reviewers,

UnfairSemaphore is an exact copy of the one used by the .Net ThreadPool:
https://github.com/dotnet/coreclr/blob/97433b9d153843492008652ff6b7c3bf4d9ff31c/src/vm/win32threadpool.h#L124

The _outstandingRequests handling is also borrowed from the .Net ThreadPool:
https://github.com/dotnet/coreclr/blob/bc146608854d1db9cdbcc0b08029a87754e12b49/src/mscorlib/src/System/Threading/ThreadPool.cs#L568

I'm confident these algorithms are solid.

The code in GetConsumingEnumerable and CompleteAdding is from me. To the best of my knowledge it's solid too, but you can pay special attention over these methods since they are new.

I'll post a gist of the test I did when I have time,

Feel free to ask for clarifications, I'll do my best to answer them.

JeffCyr · 2015-12-18T06:10:41Z

src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs


-        public void Dispose(bool isDisposing)
+        [StructLayout(LayoutKind.Sequential)]


This should not be LayoutKind.Sequential. It can cause a misalignment issue in x86 when we use Interlocked64

After further investigation, LayoutKind.Sequential does not play well when the fields of the class contains struct with LayoutKind.Explicit. The fields are not aligned as expected.

I will find a robust way to guarantee the proper alignment of the UnfairSemaphore's fields when I have some free time.

Ok I had to refresh my memory about memory alignment and it has nothing to do with the LayoutKind. In x86 the object in the heap are 4 byte aligned, so depending where the object is allocated, its fields might not be 8 byte aligned.

This would have no impact if the underlying architecture was a real x86, but on a x64 architecture, the Interlocked performance is affected when the memory is not 8 byte aligned.

There are a few options:

Leave it like that and accept the random perf in x86.

Use a different semaphore implementation in x86

Enforce the address of the field on a 8 byte boundary, but it may be hackish.

I wouldn't worry about our X86 homies - increasingly rare architecture outside of small client hardware. But do leave a comment

Yeah real x86 hardware is obsolete, but x86 processes running on x64 hardware is still pretty common when you have to use legacy libs that only work in x86.

Interesting answer by @mikedn about why the misalignment is not an issue with the CLR ThreadPool:
https://github.com/dotnet/coreclr/issues/2383#issuecomment-165957482

mattnischan · 2015-12-19T04:45:49Z

BTW, if anyone is interested why the magic padding works the way it does, there's a good article on why Disruptor does it in Java: http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast_22.html?m=1

rogeralsing · 2015-12-19T09:30:52Z

@JeffCyr if you got some time over, I would love to hear some suggestions on how to implement this properly:
https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka/Actor/Scheduler/DedicatedThreadScheduler.cs#L24

We dont really see any perf issues there AFAIK, but it is super hacky and if we could make that execute scheduled work with better precision, that would be a nice bonus.

JeffCyr · 2015-12-19T11:49:32Z

@mattnischan Great article, it might be a good idea to pad the _outstandingRequests counter as well.

I did some benchmarks with disruptor-net a couple of years ago and did observe random perf issue in x86. I didn't understand at that time, but it may be the same issue about misaligned InterlockedCompareExchange64.

https://github.com/disruptor-net/Disruptor-net/blob/8fc6cde3b6f806d05c16b931113cd82ba68180d6/Disruptor/Sequence.cs#L55

JeffCyr · 2015-12-19T11:51:31Z

@rogeralsing Sure! I'll take a look.

JeffCyr · 2015-12-20T15:35:30Z

@rogeralsing I created issue #1593 for the DedicatedThreadScheduler perf review.

annymsMthd · 2015-12-28T17:54:45Z

Looks like a couple of builds failed. Any info on these @Aaronontheweb ? It doesn't look like the performance tests generate a summary page yet

Aaronontheweb · 2015-12-30T21:15:15Z

src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs

    {
-        internal class DedicatedThreadPoolSupervisor : IDisposable
+        //internal class DedicatedThreadPoolSupervisor : IDisposable


go ahead and just delete the code here if we're not going to use it.

Aaronontheweb · 2016-01-06T19:40:45Z

@JeffCyr sent you an email for an idea on how we can review this and get it merged

JeffCyr · 2016-01-08T18:08:45Z

@Aaronontheweb I rebased the branch, deleted commented code and added some comments. Ready to go!

Aaronontheweb · 2016-01-08T19:08:04Z

Performance numbers from latest build of dev:

Akka.Remote.Tests.Performance.ForkJoinDispatcherRemoteMessagingThroughputSpec+OneWay

Measures the throughput of Akka.Remote over a particular transport using one-way messaging
1/7/2016 7:51:41 PM

System Info

NBench=NBench, Version=0.1.5.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4

NBench Settings

RunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01

Data

Totals

Metric	Units	Max	Average	Min	StdDev
TotalCollections [Gen0]	collections	23.00	22.54	22.00	0.52
TotalCollections [Gen1]	collections	1.00	0.92	0.00	0.28
TotalCollections [Gen2]	collections	0.00	0.00	0.00	0.00
[Counter] RemoteMessageReceived	operations	10,000.00	10,000.00	10,000.00	0.00

Per-second Totals

Metric	Units / s	Max / s	Average / s	Min / s	StdDev / s
TotalCollections [Gen0]	collections	30.81	27.63	22.69	2.52
TotalCollections [Gen1]	collections	1.40	1.14	0.00	0.36
TotalCollections [Gen2]	collections	0.00	0.00	0.00	0.00
[Counter] RemoteMessageReceived	operations	14,003.68	12,269.85	9,865.98	1,197.30

Raw Data

(Removed GC raw data for brevity)

[Counter] RemoteMessageReceived

Run #	operations	operations / s	ns / operations
1	10,000.00	11,699.21	85,475.83
2	10,000.00	13,377.19	74,754.12
3	10,000.00	14,003.68	71,409.81
4	10,000.00	11,183.44	89,417.96
5	10,000.00	12,459.31	80,261.25
6	10,000.00	9,865.98	101,358.38
7	10,000.00	10,960.23	91,238.98
8	10,000.00	13,302.51	75,173.79
9	10,000.00	13,020.51	76,801.93
10	10,000.00	13,375.84	74,761.65
11	10,000.00	12,859.30	77,764.75
12	10,000.00	11,907.15	83,983.14
13	10,000.00	11,493.77	87,003.69

Aaronontheweb · 2016-01-08T19:10:10Z

Performance numbers for THIS PR

Akka.Remote.Tests.Performance.ForkJoinDispatcherRemoteMessagingThroughputSpec+OneWay

Measures the throughput of Akka.Remote over a particular transport using one-way messaging
1/8/2016 6:33:54 PM

System Info

NBench=NBench, Version=0.1.5.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4

NBench Settings

RunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01

Data

Totals

Metric	Units	Max	Average	Min	StdDev
TotalCollections [Gen0]	collections	23.00	23.00	23.00	0.00
TotalCollections [Gen1]	collections	1.00	0.62	0.00	0.51
TotalCollections [Gen2]	collections	0.00	0.00	0.00	0.00
[Counter] RemoteMessageReceived	operations	10,000.00	10,000.00	10,000.00	0.00

Per-second Totals

Metric	Units / s	Max / s	Average / s	Min / s	StdDev / s
TotalCollections [Gen0]	collections	43.57	41.66	39.99	1.18
TotalCollections [Gen1]	collections	1.89	1.11	0.00	0.92
TotalCollections [Gen2]	collections	0.00	0.00	0.00	0.00
[Counter] RemoteMessageReceived	operations	18,941.62	18,113.09	17,388.91	511.60

Raw Data

(Removed GC raw data for brevity)

[Counter] RemoteMessageReceived

Run #	operations	operations / s	ns / operations
1	10,000.00	18,745.99	53,344.75
2	10,000.00	18,941.62	52,793.80
3	10,000.00	18,007.26	55,533.17
4	10,000.00	18,860.80	53,020.03
5	10,000.00	18,051.50	55,397.05
6	10,000.00	18,121.46	55,183.18
7	10,000.00	18,138.26	55,132.09
8	10,000.00	18,194.60	54,961.37
9	10,000.00	18,309.07	54,617.74
10	10,000.00	17,700.62	56,495.20
11	10,000.00	17,505.50	57,124.91
12	10,000.00	17,504.64	57,127.70
13	10,000.00	17,388.91	57,507.93

Aaronontheweb · 2016-01-08T19:11:24Z

Avg throughput of dev: 12,269.85 msg / s
Avg throughput of this PR: 18,113.09 msg / s

Lol holy shit

annymsMthd · 2016-01-08T19:11:45Z

So much win

Increased performance of DedicatedThreadPool

skotzko added the needs review label Dec 17, 2015

JeffCyr reviewed Dec 17, 2015
View reviewed changes

JeffCyr mentioned this pull request Dec 18, 2015

added dispatcher throughput specs for ThreadPoolDispatcher and ForkJoinDispatcher #1550

Merged

JeffCyr reviewed Dec 18, 2015
View reviewed changes

JeffCyr mentioned this pull request Dec 19, 2015

Investigate whether or not dispatchers, DedicatedThreadPools, and other resources are properly disposed on ActorSystem.Shutdown #1549

Closed

Aaronontheweb reviewed Dec 30, 2015
View reviewed changes

Improved DedicatedThreadPool performance

f3e49f0

JeffCyr force-pushed the experimental-dedicatedthreadpool branch from b56002d to f3e49f0 Compare January 8, 2016 18:06

Aaronontheweb added a commit that referenced this pull request Jan 12, 2016

Merge pull request #1569 from JeffCyr/experimental-dedicatedthreadpool

3dc28bb

Increased performance of DedicatedThreadPool

Aaronontheweb merged commit 3dc28bb into akkadotnet:dev Jan 12, 2016

Aaronontheweb removed the needs review label Jan 12, 2016

This was referenced Jan 13, 2016

Akka.Remote performance - ForkJoinDispatcher #1538

Closed

Akka.NET v1.0.6 release (DEV) #1653

Merged

Akka.NET v1.0.6 release (PRODUCTION) #1654

Merged

JeffCyr deleted the experimental-dedicatedthreadpool branch February 9, 2016 04:01


		public void Dispose(bool isDisposing)
		[StructLayout(LayoutKind.Sequential)]

Increased performance of DedicatedThreadPool #1569

Increased performance of DedicatedThreadPool #1569

Uh oh!

Conversation

JeffCyr commented Dec 17, 2015

Before the refactoring:

Throughput test (Queue 10 000 000 items and wait completion)

Latency test (Sequentially queue 300 000 items)

After the refactoring:

Throughput test (Queue 10 000 000 items and wait completion)

Latency test (Sequentially queue 300 000 items)

Uh oh!

Aaronontheweb commented Dec 17, 2015

Uh oh!

Aaronontheweb commented Dec 17, 2015

Uh oh!

JeffCyr commented Dec 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeffCyr commented Dec 17, 2015

Uh oh!

rogeralsing commented Dec 17, 2015

Uh oh!

JeffCyr commented Dec 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattnischan commented Dec 19, 2015

Uh oh!

rogeralsing commented Dec 19, 2015

Uh oh!

JeffCyr commented Dec 19, 2015

Uh oh!

JeffCyr commented Dec 19, 2015

Uh oh!

JeffCyr commented Dec 20, 2015

Uh oh!

annymsMthd commented Dec 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb commented Jan 6, 2016

Uh oh!

JeffCyr commented Jan 8, 2016

Uh oh!

Aaronontheweb commented Jan 8, 2016

Akka.Remote.Tests.Performance.ForkJoinDispatcherRemoteMessagingThroughputSpec+OneWay

System Info

NBench Settings

Data

Totals

Per-second Totals

Raw Data

[Counter] RemoteMessageReceived

Uh oh!

Aaronontheweb commented Jan 8, 2016

Akka.Remote.Tests.Performance.ForkJoinDispatcherRemoteMessagingThroughputSpec+OneWay

System Info

NBench Settings

Data

Totals

Per-second Totals

Raw Data

[Counter] RemoteMessageReceived

Uh oh!

Aaronontheweb commented Jan 8, 2016

Uh oh!

annymsMthd commented Jan 8, 2016

Uh oh!