Skip to content

Conversation

soumith
Copy link
Member

@soumith soumith commented Jul 24, 2017

[DONOT MERGE], only for contbuild

pietern and others added 30 commits February 9, 2017 12:33
Summary:
In the GitHub repository this directory will be mirrored similar to
folly, such that the repository has a single top level directory
called "gloo". This allows for versioning or renaming of the
project root, without having to mangle the include paths; they will
always use the "gloo" prefix.

fbshipit-source-id: 24502e4185fc7cbe19b5249f83609e2b8118e9d7
Summary:
Testing pull request again.
Closes pytorch/gloo#2

Reviewed By: pietern

Differential Revision: D4542327

Pulled By: Yangqing

fbshipit-source-id: 5bd66c32c7249f1327225117815bef64b8708722
Summary:
The CUDA benchmark suite will be a separate build target, so the
runner should be reused.

Reviewed By: Yangqing

Differential Revision: D4545092

fbshipit-source-id: 6ccf2d30f5d35c74fc59851b25416bfe6863d62c
Summary:
This CUDA-aware ring allreduce is based on the regular ring allreduce.
It runs the reduction algorithm on the CPU and is therefore most
suited for smaller buffers.

Both the device-to-host memcpy's at the start of the algorithm and the
host-to-device memcpy's at the end of the algorithm are kicked off
asynchronously in an attempt to parallize as much as possible.

Reviewed By: Yangqing

Differential Revision: D4542816

fbshipit-source-id: 101dfad276ca79703e37ff93fb1b6d467295f66b
Summary: TSIA

Reviewed By: plapukhov

Differential Revision: D4549105

fbshipit-source-id: 61c8966e429e0701677f441aeaaf27fdc5e669e7
Summary:
Separate benchmark build target for CUDA-aware algorithms.

This is needed to keep CUDA an optional dependency.

Differential Revision: D4546932

fbshipit-source-id: b73176ae9067233f883d51ba3ab4efbb13a6f86f
Summary:
Implement CUDA BroadcastOneToAll algorithm for GPU addresses. Refactor cuda.h into cuda_private.h to allow inclusion of <cuda.h> in public headers without polluting the namespace.

Port broadcast tests to GPU variants.

* this revision is based on Peter's revision D4546932

Differential Revision: D4547382

fbshipit-source-id: 3d294ad8862b04fb783ba22e5c925b8d7cbc8a8d
Summary:
In synchronous mode, it is not the device thread that is responsible
for handling I/O, but the user thread itself. Calling waitRecv on a
buffer will trigger the read function on the pair to be called. This
eliminates the context switch necessary if the device thread is
handling all I/O. For benchmarks with small numbers of elements this
reduces latency by as much as 20%.

Reviewed By: plapukhov

Differential Revision: D4549998

fbshipit-source-id: ab718ba090c06d7c7aa4065cc9f92bd96b9e4a35
Summary:
The CudaDevicePointer optionally takes an existing stream on
which it runs any operation associated with the pointer (for now just
memcpy's, but this likely will includes kernel execution in the
future).

Differential Revision: D4574035

fbshipit-source-id: ddd7972a3874012059f1fde1b341fd6edd69102d
Summary:
Latency optimization is going well and I've seen the odd case of <10us
measurements. This option makes the benchmark tool display nanos
instead.

Differential Revision: D4575925

fbshipit-source-id: 98dbd3b39e31cbcdd4c146613f6630e721187e1e
Summary: Ideally we would want the driver to busy-poll for us. In absence of driver support, spinning with MSG_DONTWAIT flag seems to be helping a lot too. Of course, we pay the price of burning one core for polling. Sigh.

Reviewed By: pietern

Differential Revision: D4576242

fbshipit-source-id: 85d9e1b786fbb6053864fba80f3e5ecc80fe221d
Summary:
First pass at a CUDA-aware allreduce chunked implementation. For now the algorithm runs on the CPU and is mostly copy/paste from allreduce_ring.h. A subsequent pass will offload to the GPU.

Serialize cuda test to avoid intermittent failures due to memory contention.

Reviewed By: pietern

Differential Revision: D4576959

fbshipit-source-id: e1f292a05b88ff24c33e549d4a52e770a21f85d2
Summary: I was mistakenly calling the non-chunked algorithm for the chunked test.

Reviewed By: pietern

Differential Revision: D4580160

fbshipit-source-id: 9d62a68e9e86cc6e596d90ff8854c585a0e8855c
Summary:
Work may be queued on CUDA streams for asynchronous execution. The
memory backed by pointers passed to any algorithm can therefore be
mutated after constructing an algorithm instance. By also passing in
the streams these mutations happen on, the algorithms can synchronize
with these mutations to ensure no invalid data is used.

By passing in these streams, any work done by these algorithms will
*also* be queued, which effectively removes a single synchronization
step from any algorithm run.

Differential Revision: D4589394

fbshipit-source-id: 0c8cd6ba9c9018f33d6f4c55a037083fc4164acb
Summary: TSIA

Differential Revision: D4591755

fbshipit-source-id: fa435f4ad6b97453c3c9516b4bfc9f8f0fb2e4f1
Summary: Adds script to populate third-party directory.

Differential Revision: D4591509

fbshipit-source-id: 28934feb536a9f3a066d8c40988337f3dddffaed
Summary: The AllReduceChunked algorithm currently performs the local reduce/broadcast of local device buffers in host memory. This diff updates the algorithm to execute the local reduce/broadcast steps using NCCL operations before copying a single device buffer to/from host memory.

Reviewed By: pietern

Differential Revision: D4587441

fbshipit-source-id: 4de689f59a6cf898b8eecd3c3b9f57f77124c0e3
Summary: Allow gloo consumers to assign a mutex to synchronize CUDA malloc/free and NCCL operations.

Reviewed By: pietern

Differential Revision: D4622135

fbshipit-source-id: 60acd7c01a677a0df5415fe38e6ef5a2e7c8606a
Summary: std::atomic was not defined for cuda.cu.

Reviewed By: andrewwdye

Differential Revision: D4624611

fbshipit-source-id: 973bba10026e065667d6a576055d00505ee02d62
Summary: TSIA

Reviewed By: andrewwdye

Differential Revision: D4626965

fbshipit-source-id: 2d32b07182202f65e673795aefacc6cc991d3c7c
Summary:
All pairs created by a device would use the same completion queue.
Supporting sync mode that way is difficult, as there is no way to
filter completions for a particular pair. This change refactors this
to use a single completion queue per pair so that this is no longer an
issue. This change is a preparation for supporting synchronous mode
(where the calling thread itself will poll the ibv library for
completions instead of the device thread).

This change also includes a refactoring of the way transient memory
regions are handled so that they are properly deregistered and
deallocated when no longer needed.

Reviewed By: andrewwdye

Differential Revision: D4625146

fbshipit-source-id: 21bf5ab321534fbd5c03f12049c10fc67da68944
Summary:
Synchronous mode means using the calling thread instead of the device
thread for completion handling. Since this saves a context switch in
the critical path, this is very beneficial for low latency algorithms.

For example: the p99 of a 4-way barrier drops from 17us to 4us.

Reviewed By: andrewwdye

Differential Revision: D4626948

fbshipit-source-id: 013b1680497589fe5ad0bca38600bce6a410200b
Summary: CUDA documentation detailing high-level support for CUDA in gloo algorithms, usage of streams, and synchronizing memory management.

Reviewed By: pietern

Differential Revision: D4633120

fbshipit-source-id: d88e230c8dc82fe48cda0f401b61758fa4f07f2e
Summary:
With this change, every buffer gets assigned a different
value at every index. This means reordering of segments (e.g. in the
chunked algorithm) would surface as test errors.

Reviewed By: andrewwdye

Differential Revision: D4636368

fbshipit-source-id: 464eb1515d1590e12481961d427a92e2ebb3be82
…ssed in

Summary: Cuda algorithms take an optional set of device streams to sequence operations. If streams are provided, the algorithms should enqueue final output buffer operations on the associated stream and return asynchronously. Destructors that allocate streams/events should synchronize before tearing down.

Reviewed By: pietern

Differential Revision: D4636447

fbshipit-source-id: 32ec2adc214c83b0b4bc0fff8993ab196459117b
Summary:
The NCCL code used in CUDA-aware allreduce does local reduction of N
buffers prior to putting anything on the wire. Supporting this in the
benchmark tool to measure the impact under various configurations.

Other minor tweaks in this change:
* Specify sub-second iteration time
* Templatize allreduce benchmarks (the algorithms share a constructor
  prototype)

Reviewed By: andrewwdye

Differential Revision: D4639517

fbshipit-source-id: f7417d3e9f79278a3b1eca48d779f48b77e5260c
Summary: TSIA

Reviewed By: andrewwdye

Differential Revision: D4644734

fbshipit-source-id: 50f5fadd2c5cd04e06a025f5538187ed852e669a
Summary: Remove underscores from public fields in NCCLContext

Reviewed By: pietern

Differential Revision: D4645857

fbshipit-source-id: 2c28a1c23d31097d685c0768dad9b99bbef7b171
Summary:
The fields are public so their names should not end with an
underscore.

Reviewed By: andrewwdye

Differential Revision: D4645038

fbshipit-source-id: c12b47affbe511383a4722717a06abb61918473b
Summary: TSIA

Reviewed By: andrewwdye

Differential Revision: D4647587

fbshipit-source-id: a804e7479e6e2f511bfa59712b4b4a88bdf657e3
pietern and others added 26 commits May 31, 2017 19:50
Summary: TSIA

Reviewed By: romain-intel

Differential Revision: D5158642

fbshipit-source-id: 6e55a69a140c1f5f6e4ce6262afaf5014c412414
Summary: Machines may not create their Gloo pairs at the same time, due to earlier variable time work. Increase the timeout used to establish the initial tcp connection to accommodate without sacrificing the shorter default timeout for outstanding reads/writes. No related change required for ibverbs as there is no communication on init.

Reviewed By: akyrola

Differential Revision: D5184518

fbshipit-source-id: 0e6c9704a2d2f1406b3927f75887f0a42199450b
Summary:
While debugging #43 I found common/common.h missing some headers as well.

Fixes #43.
Closes pytorch/gloo#44

Differential Revision: D5194970

Pulled By: pietern

fbshipit-source-id: 4861cd04c56931d4759f5bc050816788252003ee
Summary: Replace call to function that is only supported in CUDA 8.0 with one that has been supported in previous releases.

Reviewed By: pietern

Differential Revision: D5231755

fbshipit-source-id: d72aec2a4a1c511064a65142887f8a05b51dad55
Summary:
\cc pietern
Minimal changes to allow gloo to compile and run with NCCL 2.0
Closes pytorch/gloo#46

Differential Revision: D5268074

Pulled By: pietern

fbshipit-source-id: 58d625d57b31cfc932f3dbbdd7a4b83d9a2e60a8
Summary:
This changes prepares for having a separate set of collectives that
use native CUDA calls instead of NCCL. This is needed to workaround
the issue where NCCL deadlocks when it is interleaved with CUDA memory
management operations in other processes on the same machine.

Includes a modification to the host reduction functions to bring them
up to parity with the NCCL reduction functions (they now incorporate
offset/counter arguments).

Reviewed By: wesolwsk

Differential Revision: D5276291

fbshipit-source-id: 8844731760d2c48577d207c026ce0cd641f2fc6d
Summary:
Previously, `gloo/math.h` inlined methods which use AVX builtins,
which required propagating the `-mavx` flag.
This diff moves these definitions out of the header and into a source
file to prevent avoid this.

Reviewed By: pixelb

Differential Revision: D5271043

fbshipit-source-id: dde4dc560dfb557b46d1a582a8b38e7cb8eb0c37
Summary:
Code in tcp/transport tries to find the network interface a socket was
bound to when create a TCP device context. Per getifaddrs(3), it is
possible for the ifa_addr field to be NULL (supposedly when an
interface doesn't have an address). Ignore such entries.

Thanks to slayton58 for reporting this.

Reviewed By: wesolwsk

Differential Revision: D5279376

fbshipit-source-id: 039380b95ba4d6d94942c30581e0b230a060870c
Summary:
Adds a separate set of CUDA collectives that run on device as an
alternative to NCCL. Use these collectives as default on-device
collectives instead of NCCL.

Whenever multiple processes on the same machine use Gloo with NCCL and
end up doing concurrent CUDA memory allocations and algorithm
execution, we risk deadlock. A follow up change will enable opt-in
usage of NCCL (e.g. through environment variable).

Benchmark output below with varying number of elements. It shows a
minor improvement over using NCCL for local reduction and broadcast.

Number of elements equal to on-device threshold (256K):

```
Device:      tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm:   cuda_allreduce_ring
Options:     processes=2, inputs=8, gpudirect=no

        elements   min (us)   p50 (us)   p99 (us)   max (us)    samples
(before)  262144       2685       2907       3035       3215        562
(after)   262144       2682       2874       3013       3395        577

Device:      tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm:   cuda_allreduce_ring_chunked
Options:     processes=2, inputs=8, gpudirect=no

        elements   min (us)   p50 (us)   p99 (us)   max (us)    samples
(before)  262144       2045       2133       2325       2643        725
(after)   262144       1533       1673       1834       2048        800

Device:      tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm:   cuda_allreduce_halving_doubling
Options:     processes=2, inputs=8, gpudirect=no

        elements   min (us)   p50 (us)   p99 (us)   max (us)    samples
(before)  262144       1580       1640       1718       2069        893
(after)   262144       1371       1446       1539       1748       1125
```

Larger number of elements (4M):

```
Device:      tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm:   cuda_allreduce_ring
Options:     processes=2, inputs=8, gpudirect=no

        elements   min (us)   p50 (us)   p99 (us)   max (us)    samples
(before) 4194304      55543      58058      60103      62659         32
(after)  4194304      54490      57923      60893      66058         33

Device:      tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm:   cuda_allreduce_ring_chunked
Options:     processes=2, inputs=8, gpudirect=no

        elements   min (us)   p50 (us)   p99 (us)   max (us)    samples
(before) 4194304      18049      22820      24997      26634        105
(after)  4194304      18356      20463      21695      22589         99

Device:      tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm:   cuda_allreduce_halving_doubling
Options:     processes=2, inputs=8, gpudirect=no

        elements   min (us)   p50 (us)   p99 (us)   max (us)    samples
(before) 4194304      18584      24345      27809      29722         95
(after)  4194304      19541      22718      25408      26688         88
```

Reviewed By: akyrola

Differential Revision: D5278192

fbshipit-source-id: 53f09e404663ddc8bb46d06ac87afd8ee3ffc3a2
Summary: Closes pytorch/gloo#47

Differential Revision: D5283752

Pulled By: pietern

fbshipit-source-id: 8ad3353b3455c5416e31e75b46755e2f7fcaad52
Summary:
Adds basic CUDA 9 support, including adding Volta arch, and making appropriate modifications for half precision datatype changes
Closes pytorch/gloo#49

Differential Revision: D5315336

Pulled By: pietern

fbshipit-source-id: 6468b0f357206d604bdcfec69ba82509a2c91407
Summary: A simple benchmark to determine network bandwidth for pairwise communication.

Reviewed By: plapukhov

Differential Revision: D5159607

fbshipit-source-id: d16c3ed3a0c2ae182138df91bdae821f5508c6ac
Summary: Use the CreateCommonWorld timeout for the storehandler as well, not just the device connect.

Reviewed By: andrewwdye

Differential Revision: D5425923

fbshipit-source-id: 936d2129e2db3bfed8759ca097b75843d3931d5f
Summary:
CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera.

The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures.

Reviewed By: Orvid

Differential Revision:
D5432398

Tags: codemod, codemod-opensource

fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914
Summary: To reduce round trips with store handlers, it is better to store all addresses in one key instead of one address per pair. This is what this implements.

Reviewed By: andrewwdye

Differential Revision: D5435893

fbshipit-source-id: 2d3ea3a2822c3b934ff2578d44a262e7bfbde6d0
Summary: When compiled with -Werror=shadow-compatible-local, cannot reuse a variable name. This passed our tests, but some people use stronger settings to compile.

Differential Revision: D5440805

fbshipit-source-id: a246af748717fb7e0e7a321e1ac4ddfef68ae524
…igned buffers

Summary: When performing reductions on fp16 buffers, gloo assumed that both buffers were either aligned to 32 bytes or misaligned by the same offset. This may not hold in intermediate steps of halving-doubling allreduce, when the reduction is performed on some offset within the receive buffer. The fix is to use intrinsic instructions that work with unaligned pointers.

Reviewed By: akyrola

Differential Revision: D5450103

fbshipit-source-id: 9a1c8f8c34d2e62223f6d5c21573ea1cfad6537f
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
…cedecc'

git-subtree-dir: torch/lib/gloo
git-subtree-mainline: 4a4d884
git-subtree-split: 1978bba
@soumith soumith closed this Jul 25, 2017
@soumith soumith deleted the gloocontbuild branch July 25, 2017 02:04
houseroad added a commit to houseroad/pytorch that referenced this pull request Jul 25, 2019
…9a6052

Summary:
Previous import was 707064980b9825b8705b9d1c9aad34d8b022d5dd

Included changes:
- **[28ca699b](onnx/onnx@28ca699b)**: Member Company logo guidelines (pytorch#2196) <Prasanth Pulavarthi>
- **[47acb06a](onnx/onnx@47acb06a)**: remove link to outdated issue for contributions wanted (pytorch#2186) <Prasanth Pulavarthi>
- **[168519f6](onnx/onnx@168519f6)**: Create sigs.md (pytorch#2103) <Prasanth Pulavarthi>
- **[b9320746](onnx/onnx@b9320746)**: mintor format update (pytorch#2180) <Prasanth Pulavarthi>
- **[65b8e0f9](onnx/onnx@65b8e0f9)**: add more types support for Equal op (pytorch#2176) <Ke Zhang>
- **[dc5e62a9](onnx/onnx@dc5e62a9)**: Update AddNewOP document. (pytorch#2172) <Emad Barsoum>
- **[bae8b530](onnx/onnx@bae8b530)**: Add missing space (pytorch#2150) <Takeshi Watanabe>
- **[5952b7f5](onnx/onnx@5952b7f5)**: python api example typo fix (pytorch#2155) <LeicongLi>
- **[904cb842](onnx/onnx@904cb842)**: Fix errors in RoiAlign shape inference code (pytorch#2167) <G. Ramalingam>

Differential Revision: D16502373

fbshipit-source-id: 68b9479a30fc330d876947cb4ea8227848f576e3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.