-
Notifications
You must be signed in to change notification settings - Fork 52
Hetero neighbor sampler multithreading #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #215 +/- ##
=======================================
Coverage 83.49% 83.49%
=======================================
Files 26 26
Lines 848 848
=======================================
Hits 708 708
Misses 140 140
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
78b60f8
to
8dd411b
Compare
The speedup does not see to be very much, do we have analysis on what is the bottleneck ? |
I think the reason is that the number of threads is limited to the number of dst nodes. As many different dst nodes there are as many threads. For the ogbn-mag dataset, this will be 4. |
8dd411b
to
930dc1a
Compare
The code generally LGTM, but i am still curious on the limited speedup. You may try from two aspects:
|
Sure, I'll check that in VTune. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A few minor comments.
} | ||
at::parallel_for(0, node_types.size(), 1, [&](size_t _s, size_t _e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this bring any gain? We are not doing heavy work here, so the overhead of threading might not be worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check this, I made two measurements. The first in red are measurements made with the current code. And in green is after removing at::parallel_for (what was before). The results show that the introduction of parallelization at this point gives maybe not a big but still speedup (I took the measurements on a different machine than the one mentioned in the PR description, which is why the results are different):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe to be sure, I'll take 10 measurements and average the results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of removing it, as it doesn't look that useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
930dc1a
to
6449aca
Compare
This reverts commit f1b10bb.
This reverts commit b58ee31.
The main roadblocks in the context of sampler parallelization is the mapper and the sampler. However, in the case of hetero sampling, there are separate mappers for each dst node type and separate samplers for each edge type. Therefore, to avoid race condition we can use parallelization per dst node type, i.e. each thread is assigned an edge types with unique dst node type that it is working on.
Example sampling times obtained using hetero_neighbor.py benchmark:
Example end-to-end times obtained using inference_benchmark.py: