-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Increased performance of DedicatedThreadPool #1569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increased performance of DedicatedThreadPool #1569
Conversation
cc @akkadotnet/developers please merge in some of our benchmarks for comparison purposes. |
Thanks @JeffCyr ! Mind contributing these changes back to https://github.com/helios-io/DedicatedThreadPool? We ship this as a stand-alone NuGet package too ;) |
@Aaronontheweb Sure! I'll do that after the reviews. |
|
||
public void Add(Action work) | ||
{ | ||
m_queue.Enqueue(work); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to protect against the race with CompleteAdding() here.
I fixed the race condition with CompleteAdding. It's now safe to dispose the DedicatedThreapool, all work items that returned true when QueueUserWorkItem was called will be executed, all the threads will exit when the queue is empty. DedicatedThreadPool.Dispose only triggers the shutdown, to wait until all thread exited, you can call DedicatedThreadPool.WaitForThreadsExit(). |
Trying to review but this is beyond my understanding of threading. |
For reviewers, UnfairSemaphore is an exact copy of the one used by the .Net ThreadPool: The _outstandingRequests handling is also borrowed from the .Net ThreadPool: I'm confident these algorithms are solid. The code in GetConsumingEnumerable and CompleteAdding is from me. To the best of my knowledge it's solid too, but you can pay special attention over these methods since they are new. I'll post a gist of the test I did when I have time, Feel free to ask for clarifications, I'll do my best to answer them. |
|
||
public void Dispose(bool isDisposing) | ||
[StructLayout(LayoutKind.Sequential)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be LayoutKind.Sequential. It can cause a misalignment issue in x86 when we use Interlocked64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further investigation, LayoutKind.Sequential does not play well when the fields of the class contains struct with LayoutKind.Explicit. The fields are not aligned as expected.
I will find a robust way to guarantee the proper alignment of the UnfairSemaphore's fields when I have some free time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I had to refresh my memory about memory alignment and it has nothing to do with the LayoutKind. In x86 the object in the heap are 4 byte aligned, so depending where the object is allocated, its fields might not be 8 byte aligned.
This would have no impact if the underlying architecture was a real x86, but on a x64 architecture, the Interlocked performance is affected when the memory is not 8 byte aligned.
There are a few options:
- Leave it like that and accept the random perf in x86.
- Use a different semaphore implementation in x86
- Enforce the address of the field on a 8 byte boundary, but it may be hackish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry about our X86 homies - increasingly rare architecture outside of small client hardware. But do leave a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah real x86 hardware is obsolete, but x86 processes running on x64 hardware is still pretty common when you have to use legacy libs that only work in x86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting answer by @mikedn about why the misalignment is not an issue with the CLR ThreadPool:
https://github.com/dotnet/coreclr/issues/2383#issuecomment-165957482
BTW, if anyone is interested why the magic padding works the way it does, there's a good article on why Disruptor does it in Java: http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast_22.html?m=1 |
@JeffCyr if you got some time over, I would love to hear some suggestions on how to implement this properly: We dont really see any perf issues there AFAIK, but it is super hacky and if we could make that execute scheduled work with better precision, that would be a nice bonus. |
@mattnischan Great article, it might be a good idea to pad the _outstandingRequests counter as well. I did some benchmarks with disruptor-net a couple of years ago and did observe random perf issue in x86. I didn't understand at that time, but it may be the same issue about misaligned InterlockedCompareExchange64. |
@rogeralsing Sure! I'll take a look. |
@rogeralsing I created issue #1593 for the DedicatedThreadScheduler perf review. |
Looks like a couple of builds failed. Any info on these @Aaronontheweb ? It doesn't look like the performance tests generate a summary page yet |
{ | ||
internal class DedicatedThreadPoolSupervisor : IDisposable | ||
//internal class DedicatedThreadPoolSupervisor : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go ahead and just delete the code here if we're not going to use it.
@JeffCyr sent you an email for an idea on how we can review this and get it merged |
b56002d
to
f3e49f0
Compare
@Aaronontheweb I rebased the branch, deleted commented code and added some comments. Ready to go! |
Performance numbers from latest build of Akka.Remote.Tests.Performance.ForkJoinDispatcherRemoteMessagingThroughputSpec+OneWayMeasures the throughput of Akka.Remote over a particular transport using one-way messaging System InfoNBench=NBench, Version=0.1.5.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4 NBench SettingsRunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01 DataTotals
Per-second Totals
Raw Data(Removed GC raw data for brevity) [Counter] RemoteMessageReceived
|
Performance numbers for THIS PR Akka.Remote.Tests.Performance.ForkJoinDispatcherRemoteMessagingThroughputSpec+OneWayMeasures the throughput of Akka.Remote over a particular transport using one-way messaging System InfoNBench=NBench, Version=0.1.5.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4 NBench SettingsRunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01 DataTotals
Per-second Totals
Raw Data(Removed GC raw data for brevity) [Counter] RemoteMessageReceived
|
Avg throughput of dev: 12,269.85 msg / s Lol holy shit |
So much win |
Increased performance of DedicatedThreadPool
Issue #1538 Akka.Remote performance - ForkJoinDispatcher
This PR contains a refactoring of the DedicatedThreadPool. The main improvement is to use a shared work queue for all workers. The work queue is implemented by a ConcurrentQueue and a custom semaphore highly optimized for the ThreadPool scenario. (I borrowed
UnfairSemaphore
from the coreclr repo, it's the one used for the real CLR ThreadPool)The new DedicatedThreadPool has ~3.5x more throughput and 32x less latency (for the worst-case).
Before the refactoring:
Throughput test (Queue 10 000 000 items and wait completion)
Global ThreadPool : 00:00:01.2247981
HeliosThreadPool : 00:00:03.8819794
Latency test (Sequentially queue 300 000 items)
Global ThreadPool : 00:00:00.4042989
HeliosThreadPool : 00:00:07.2341554
After the refactoring:
Throughput test (Queue 10 000 000 items and wait completion)
Global ThreadPool : 00:00:01.2171287
HeliosThreadPool : 00:00:01.0755531
Latency test (Sequentially queue 300 000 items)
Global ThreadPool : 00:00:00.5609632
HeliosThreadPool : 00:00:00.2240666
I commented the deadlock detection code, the option is disabled by default and enabling it was extremely dangerous. Depending on your feedback we can leave it disabled or find a better way to mitigate the pool deadlock.