-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Adding Gloo as a subtree #2196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Adding Gloo as a subtree #2196
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: In the GitHub repository this directory will be mirrored similar to folly, such that the repository has a single top level directory called "gloo". This allows for versioning or renaming of the project root, without having to mangle the include paths; they will always use the "gloo" prefix. fbshipit-source-id: 24502e4185fc7cbe19b5249f83609e2b8118e9d7
Summary: Testing pull request again. Closes pytorch/gloo#2 Reviewed By: pietern Differential Revision: D4542327 Pulled By: Yangqing fbshipit-source-id: 5bd66c32c7249f1327225117815bef64b8708722
Summary: The CUDA benchmark suite will be a separate build target, so the runner should be reused. Reviewed By: Yangqing Differential Revision: D4545092 fbshipit-source-id: 6ccf2d30f5d35c74fc59851b25416bfe6863d62c
Summary: This CUDA-aware ring allreduce is based on the regular ring allreduce. It runs the reduction algorithm on the CPU and is therefore most suited for smaller buffers. Both the device-to-host memcpy's at the start of the algorithm and the host-to-device memcpy's at the end of the algorithm are kicked off asynchronously in an attempt to parallize as much as possible. Reviewed By: Yangqing Differential Revision: D4542816 fbshipit-source-id: 101dfad276ca79703e37ff93fb1b6d467295f66b
Summary: TSIA Reviewed By: plapukhov Differential Revision: D4549105 fbshipit-source-id: 61c8966e429e0701677f441aeaaf27fdc5e669e7
Summary: Separate benchmark build target for CUDA-aware algorithms. This is needed to keep CUDA an optional dependency. Differential Revision: D4546932 fbshipit-source-id: b73176ae9067233f883d51ba3ab4efbb13a6f86f
Summary: Implement CUDA BroadcastOneToAll algorithm for GPU addresses. Refactor cuda.h into cuda_private.h to allow inclusion of <cuda.h> in public headers without polluting the namespace. Port broadcast tests to GPU variants. * this revision is based on Peter's revision D4546932 Differential Revision: D4547382 fbshipit-source-id: 3d294ad8862b04fb783ba22e5c925b8d7cbc8a8d
Summary: In synchronous mode, it is not the device thread that is responsible for handling I/O, but the user thread itself. Calling waitRecv on a buffer will trigger the read function on the pair to be called. This eliminates the context switch necessary if the device thread is handling all I/O. For benchmarks with small numbers of elements this reduces latency by as much as 20%. Reviewed By: plapukhov Differential Revision: D4549998 fbshipit-source-id: ab718ba090c06d7c7aa4065cc9f92bd96b9e4a35
Summary: The CudaDevicePointer optionally takes an existing stream on which it runs any operation associated with the pointer (for now just memcpy's, but this likely will includes kernel execution in the future). Differential Revision: D4574035 fbshipit-source-id: ddd7972a3874012059f1fde1b341fd6edd69102d
Summary: Latency optimization is going well and I've seen the odd case of <10us measurements. This option makes the benchmark tool display nanos instead. Differential Revision: D4575925 fbshipit-source-id: 98dbd3b39e31cbcdd4c146613f6630e721187e1e
Summary: Ideally we would want the driver to busy-poll for us. In absence of driver support, spinning with MSG_DONTWAIT flag seems to be helping a lot too. Of course, we pay the price of burning one core for polling. Sigh. Reviewed By: pietern Differential Revision: D4576242 fbshipit-source-id: 85d9e1b786fbb6053864fba80f3e5ecc80fe221d
Summary: First pass at a CUDA-aware allreduce chunked implementation. For now the algorithm runs on the CPU and is mostly copy/paste from allreduce_ring.h. A subsequent pass will offload to the GPU. Serialize cuda test to avoid intermittent failures due to memory contention. Reviewed By: pietern Differential Revision: D4576959 fbshipit-source-id: e1f292a05b88ff24c33e549d4a52e770a21f85d2
Summary: I was mistakenly calling the non-chunked algorithm for the chunked test. Reviewed By: pietern Differential Revision: D4580160 fbshipit-source-id: 9d62a68e9e86cc6e596d90ff8854c585a0e8855c
Summary: Work may be queued on CUDA streams for asynchronous execution. The memory backed by pointers passed to any algorithm can therefore be mutated after constructing an algorithm instance. By also passing in the streams these mutations happen on, the algorithms can synchronize with these mutations to ensure no invalid data is used. By passing in these streams, any work done by these algorithms will *also* be queued, which effectively removes a single synchronization step from any algorithm run. Differential Revision: D4589394 fbshipit-source-id: 0c8cd6ba9c9018f33d6f4c55a037083fc4164acb
Summary: TSIA Differential Revision: D4591755 fbshipit-source-id: fa435f4ad6b97453c3c9516b4bfc9f8f0fb2e4f1
Summary: Adds script to populate third-party directory. Differential Revision: D4591509 fbshipit-source-id: 28934feb536a9f3a066d8c40988337f3dddffaed
Summary: The AllReduceChunked algorithm currently performs the local reduce/broadcast of local device buffers in host memory. This diff updates the algorithm to execute the local reduce/broadcast steps using NCCL operations before copying a single device buffer to/from host memory. Reviewed By: pietern Differential Revision: D4587441 fbshipit-source-id: 4de689f59a6cf898b8eecd3c3b9f57f77124c0e3
Summary: Allow gloo consumers to assign a mutex to synchronize CUDA malloc/free and NCCL operations. Reviewed By: pietern Differential Revision: D4622135 fbshipit-source-id: 60acd7c01a677a0df5415fe38e6ef5a2e7c8606a
Summary: std::atomic was not defined for cuda.cu. Reviewed By: andrewwdye Differential Revision: D4624611 fbshipit-source-id: 973bba10026e065667d6a576055d00505ee02d62
Summary: TSIA Reviewed By: andrewwdye Differential Revision: D4626965 fbshipit-source-id: 2d32b07182202f65e673795aefacc6cc991d3c7c
Summary: All pairs created by a device would use the same completion queue. Supporting sync mode that way is difficult, as there is no way to filter completions for a particular pair. This change refactors this to use a single completion queue per pair so that this is no longer an issue. This change is a preparation for supporting synchronous mode (where the calling thread itself will poll the ibv library for completions instead of the device thread). This change also includes a refactoring of the way transient memory regions are handled so that they are properly deregistered and deallocated when no longer needed. Reviewed By: andrewwdye Differential Revision: D4625146 fbshipit-source-id: 21bf5ab321534fbd5c03f12049c10fc67da68944
Summary: Synchronous mode means using the calling thread instead of the device thread for completion handling. Since this saves a context switch in the critical path, this is very beneficial for low latency algorithms. For example: the p99 of a 4-way barrier drops from 17us to 4us. Reviewed By: andrewwdye Differential Revision: D4626948 fbshipit-source-id: 013b1680497589fe5ad0bca38600bce6a410200b
Summary: CUDA documentation detailing high-level support for CUDA in gloo algorithms, usage of streams, and synchronizing memory management. Reviewed By: pietern Differential Revision: D4633120 fbshipit-source-id: d88e230c8dc82fe48cda0f401b61758fa4f07f2e
Summary: With this change, every buffer gets assigned a different value at every index. This means reordering of segments (e.g. in the chunked algorithm) would surface as test errors. Reviewed By: andrewwdye Differential Revision: D4636368 fbshipit-source-id: 464eb1515d1590e12481961d427a92e2ebb3be82
…ssed in Summary: Cuda algorithms take an optional set of device streams to sequence operations. If streams are provided, the algorithms should enqueue final output buffer operations on the associated stream and return asynchronously. Destructors that allocate streams/events should synchronize before tearing down. Reviewed By: pietern Differential Revision: D4636447 fbshipit-source-id: 32ec2adc214c83b0b4bc0fff8993ab196459117b
Summary: The NCCL code used in CUDA-aware allreduce does local reduction of N buffers prior to putting anything on the wire. Supporting this in the benchmark tool to measure the impact under various configurations. Other minor tweaks in this change: * Specify sub-second iteration time * Templatize allreduce benchmarks (the algorithms share a constructor prototype) Reviewed By: andrewwdye Differential Revision: D4639517 fbshipit-source-id: f7417d3e9f79278a3b1eca48d779f48b77e5260c
Summary: TSIA Reviewed By: andrewwdye Differential Revision: D4644734 fbshipit-source-id: 50f5fadd2c5cd04e06a025f5538187ed852e669a
Summary: Remove underscores from public fields in NCCLContext Reviewed By: pietern Differential Revision: D4645857 fbshipit-source-id: 2c28a1c23d31097d685c0768dad9b99bbef7b171
Summary: The fields are public so their names should not end with an underscore. Reviewed By: andrewwdye Differential Revision: D4645038 fbshipit-source-id: c12b47affbe511383a4722717a06abb61918473b
Summary: TSIA Reviewed By: andrewwdye Differential Revision: D4647587 fbshipit-source-id: a804e7479e6e2f511bfa59712b4b4a88bdf657e3
Summary: TSIA Reviewed By: romain-intel Differential Revision: D5158642 fbshipit-source-id: 6e55a69a140c1f5f6e4ce6262afaf5014c412414
Summary: Machines may not create their Gloo pairs at the same time, due to earlier variable time work. Increase the timeout used to establish the initial tcp connection to accommodate without sacrificing the shorter default timeout for outstanding reads/writes. No related change required for ibverbs as there is no communication on init. Reviewed By: akyrola Differential Revision: D5184518 fbshipit-source-id: 0e6c9704a2d2f1406b3927f75887f0a42199450b
Summary: While debugging #43 I found common/common.h missing some headers as well. Fixes #43. Closes pytorch/gloo#44 Differential Revision: D5194970 Pulled By: pietern fbshipit-source-id: 4861cd04c56931d4759f5bc050816788252003ee
Fix NCCL directory typo
Summary: Replace call to function that is only supported in CUDA 8.0 with one that has been supported in previous releases. Reviewed By: pietern Differential Revision: D5231755 fbshipit-source-id: d72aec2a4a1c511064a65142887f8a05b51dad55
Summary: \cc pietern Minimal changes to allow gloo to compile and run with NCCL 2.0 Closes pytorch/gloo#46 Differential Revision: D5268074 Pulled By: pietern fbshipit-source-id: 58d625d57b31cfc932f3dbbdd7a4b83d9a2e60a8
Summary: This changes prepares for having a separate set of collectives that use native CUDA calls instead of NCCL. This is needed to workaround the issue where NCCL deadlocks when it is interleaved with CUDA memory management operations in other processes on the same machine. Includes a modification to the host reduction functions to bring them up to parity with the NCCL reduction functions (they now incorporate offset/counter arguments). Reviewed By: wesolwsk Differential Revision: D5276291 fbshipit-source-id: 8844731760d2c48577d207c026ce0cd641f2fc6d
Summary: Previously, `gloo/math.h` inlined methods which use AVX builtins, which required propagating the `-mavx` flag. This diff moves these definitions out of the header and into a source file to prevent avoid this. Reviewed By: pixelb Differential Revision: D5271043 fbshipit-source-id: dde4dc560dfb557b46d1a582a8b38e7cb8eb0c37
Summary: Code in tcp/transport tries to find the network interface a socket was bound to when create a TCP device context. Per getifaddrs(3), it is possible for the ifa_addr field to be NULL (supposedly when an interface doesn't have an address). Ignore such entries. Thanks to slayton58 for reporting this. Reviewed By: wesolwsk Differential Revision: D5279376 fbshipit-source-id: 039380b95ba4d6d94942c30581e0b230a060870c
Summary: Adds a separate set of CUDA collectives that run on device as an alternative to NCCL. Use these collectives as default on-device collectives instead of NCCL. Whenever multiple processes on the same machine use Gloo with NCCL and end up doing concurrent CUDA memory allocations and algorithm execution, we risk deadlock. A follow up change will enable opt-in usage of NCCL (e.g. through environment variable). Benchmark output below with varying number of elements. It shows a minor improvement over using NCCL for local reduction and broadcast. Number of elements equal to on-device threshold (256K): ``` Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000 Algorithm: cuda_allreduce_ring Options: processes=2, inputs=8, gpudirect=no elements min (us) p50 (us) p99 (us) max (us) samples (before) 262144 2685 2907 3035 3215 562 (after) 262144 2682 2874 3013 3395 577 Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000 Algorithm: cuda_allreduce_ring_chunked Options: processes=2, inputs=8, gpudirect=no elements min (us) p50 (us) p99 (us) max (us) samples (before) 262144 2045 2133 2325 2643 725 (after) 262144 1533 1673 1834 2048 800 Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000 Algorithm: cuda_allreduce_halving_doubling Options: processes=2, inputs=8, gpudirect=no elements min (us) p50 (us) p99 (us) max (us) samples (before) 262144 1580 1640 1718 2069 893 (after) 262144 1371 1446 1539 1748 1125 ``` Larger number of elements (4M): ``` Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000 Algorithm: cuda_allreduce_ring Options: processes=2, inputs=8, gpudirect=no elements min (us) p50 (us) p99 (us) max (us) samples (before) 4194304 55543 58058 60103 62659 32 (after) 4194304 54490 57923 60893 66058 33 Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000 Algorithm: cuda_allreduce_ring_chunked Options: processes=2, inputs=8, gpudirect=no elements min (us) p50 (us) p99 (us) max (us) samples (before) 4194304 18049 22820 24997 26634 105 (after) 4194304 18356 20463 21695 22589 99 Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000 Algorithm: cuda_allreduce_halving_doubling Options: processes=2, inputs=8, gpudirect=no elements min (us) p50 (us) p99 (us) max (us) samples (before) 4194304 18584 24345 27809 29722 95 (after) 4194304 19541 22718 25408 26688 88 ``` Reviewed By: akyrola Differential Revision: D5278192 fbshipit-source-id: 53f09e404663ddc8bb46d06ac87afd8ee3ffc3a2
Summary: Closes pytorch/gloo#47 Differential Revision: D5283752 Pulled By: pietern fbshipit-source-id: 8ad3353b3455c5416e31e75b46755e2f7fcaad52
Summary: Adds basic CUDA 9 support, including adding Volta arch, and making appropriate modifications for half precision datatype changes Closes pytorch/gloo#49 Differential Revision: D5315336 Pulled By: pietern fbshipit-source-id: 6468b0f357206d604bdcfec69ba82509a2c91407
Summary: A simple benchmark to determine network bandwidth for pairwise communication. Reviewed By: plapukhov Differential Revision: D5159607 fbshipit-source-id: d16c3ed3a0c2ae182138df91bdae821f5508c6ac
Summary: Use the CreateCommonWorld timeout for the storehandler as well, not just the device connect. Reviewed By: andrewwdye Differential Revision: D5425923 fbshipit-source-id: 936d2129e2db3bfed8759ca097b75843d3931d5f
Summary: CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera. The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures. Reviewed By: Orvid Differential Revision: D5432398 Tags: codemod, codemod-opensource fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914
Summary: To reduce round trips with store handlers, it is better to store all addresses in one key instead of one address per pair. This is what this implements. Reviewed By: andrewwdye Differential Revision: D5435893 fbshipit-source-id: 2d3ea3a2822c3b934ff2578d44a262e7bfbde6d0
Summary: When compiled with -Werror=shadow-compatible-local, cannot reuse a variable name. This passed our tests, but some people use stronger settings to compile. Differential Revision: D5440805 fbshipit-source-id: a246af748717fb7e0e7a321e1ac4ddfef68ae524
…igned buffers Summary: When performing reductions on fp16 buffers, gloo assumed that both buffers were either aligned to 32 bytes or misaligned by the same offset. This may not hold in intermediate steps of halving-doubling allreduce, when the reduction is performed on some offset within the receive buffer. The fix is to use intrinsic instructions that work with unaligned pointers. Reviewed By: akyrola Differential Revision: D5450103 fbshipit-source-id: 9a1c8f8c34d2e62223f6d5c21573ea1cfad6537f
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
houseroad
added a commit
to houseroad/pytorch
that referenced
this pull request
Jul 25, 2019
…9a6052 Summary: Previous import was 707064980b9825b8705b9d1c9aad34d8b022d5dd Included changes: - **[28ca699b](onnx/onnx@28ca699b)**: Member Company logo guidelines (pytorch#2196) <Prasanth Pulavarthi> - **[47acb06a](onnx/onnx@47acb06a)**: remove link to outdated issue for contributions wanted (pytorch#2186) <Prasanth Pulavarthi> - **[168519f6](onnx/onnx@168519f6)**: Create sigs.md (pytorch#2103) <Prasanth Pulavarthi> - **[b9320746](onnx/onnx@b9320746)**: mintor format update (pytorch#2180) <Prasanth Pulavarthi> - **[65b8e0f9](onnx/onnx@65b8e0f9)**: add more types support for Equal op (pytorch#2176) <Ke Zhang> - **[dc5e62a9](onnx/onnx@dc5e62a9)**: Update AddNewOP document. (pytorch#2172) <Emad Barsoum> - **[bae8b530](onnx/onnx@bae8b530)**: Add missing space (pytorch#2150) <Takeshi Watanabe> - **[5952b7f5](onnx/onnx@5952b7f5)**: python api example typo fix (pytorch#2155) <LeicongLi> - **[904cb842](onnx/onnx@904cb842)**: Fix errors in RoiAlign shape inference code (pytorch#2167) <G. Ramalingam> Differential Revision: D16502373 fbshipit-source-id: 68b9479a30fc330d876947cb4ea8227848f576e3
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[DONOT MERGE], only for contbuild